Análisis Estadístico y ML del Titanic

Author

Fabrizio Torrico

Published

May 7, 2025


This kernel is for all aspiring data scientists to learn from and to review their knowledge. We will have a detailed statistical analysis of Titanic data set along with Machine learning model implementation. I am super excited to share my first kernel with the Kaggle community. As I go on in this journey and learn new topics, I will incorporate them with each new updates. So, check for them and please leave a comment if you have any suggestions to make this kernel better!! Going back to the topics of this kernel, I will do more in-depth visualizations to explain the data, and the machine learning classifiers will be used to predict passenger survival status.

NOTE:

Kernel Goals


There are three primary goals of this kernel.

  • Do a statistical analysis of how some group of people was survived more than others.
  • Do an exploratory data analysis(EDA) of titanic with visualizations and storytelling.
  • Predict: Use machine learning classification models to predict the chances of passengers survival.

P.S. If you want to learn more about regression models, try this kernel.

Part 1: Importing Necessary Libraries and datasets


1a. Loading libraries

Python is a fantastic language with a vibrant community that produces many amazing libraries. I am not a big fan of importing everything at once for the newcomers. So, I am going to introduce a few necessary libraries for now, and as we go on, we will keep unboxing new libraries when it seems appropriate.

using Pkg
Pkg.activate(".")
Pkg.add(["IJulia", "DataFrames", "CSV", "CairoMakie", "StatsBase",
         "Statistics", "MLJ", "MLJModels", "MLJBase", "HypothesisTests",
         "Distributions", "Missings", "CategoricalArrays", "AlgebraOfGraphics", "Chain"])
  Activating project at `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook`
   Resolving package versions...
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Project.toml`
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Manifest.toml`

::: {#4 .cell _cell_guid=‘80643cb5-64f3-4180-92a9-2f8e83263ac6’ _kg_hide-input=‘true’ _uuid=‘33d54abf387474bce3017f1fc3832493355010c0’ tags=‘[]’ execution_count=1}

import DataFrames as DF
import CSV
import CairoMakie as Makie
import AlgebraOfGraphics as AoG
import Statistics as Stats
import StatsBase
import Chain: @chain
import Random: shuffle
import IJulia

:::

readdir("./input/")
3-element Vector{String}:
 "gender_submission.csv"
 "test.csv"
 "train.csv"

1b. Loading Datasets


After loading the necessary modules, we need to import the datasets. Many of the business problems usually come with a tremendous amount of messy data. We extract those data from many sources. I am hoping to write about that in a different kernel. For now, we are going to work with a less complicated and quite popular machine learning dataset.

## Importing the datasets
using CSV

train = CSV.read("./input/train.csv", DF.DataFrame)
test = CSV.read("./input/test.csv", DF.DataFrame);

You are probably wondering why two datasets? Also, Why have I named it “train” and “test”? To explain that I am going to give you an overall picture of the supervised machine learning process.

“Machine Learning” is simply “Machine” and “Learning”. Nothing more and nothing less. In a supervised machine learning process, we are giving machine/computer/models specific inputs or data(text/number/image/audio) to learn from aka we are training the machine to learn certain aspects based on the data and the output. Now, how can we determine that machine is actually learning what we are try to teach? That is where the test set comes to play. We withhold part of the data where we know the output/result of each datapoints, and we use this data to test the trained models. We then compare the outcomes to determine the performance of the algorithms. If you are a bit confused thats okay. I will explain more as we keep reading. Let’s take a look at sample datasets.

DF.first(train, 5)
5×12 DataFrame
Row PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Int64 Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64 String15? String1?
1 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.25 missing S
2 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38.0 1 0 PC 17599 71.2833 C85 C
3 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.925 missing S
4 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1 C123 S
5 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.05 missing S
@chain train begin
    DF.dropmissing(:Age) # Drop rows with missing Age
    DF.groupby(:Sex)
    DF.combine(:Age => minimum => :MinAge)
end
2×2 DataFrame
Row Sex MinAge
String7 Float64
1 male 0.42
2 female 0.75
DF.describe(train, :eltype)
12×2 DataFrame
Row variable eltype
Symbol Type
1 PassengerId Int64
2 Survived Int64
3 Pclass Int64
4 Name String
5 Sex String7
6 Age Union{Missing, Float64}
7 SibSp Int64
8 Parch Int64
9 Ticket String31
10 Fare Float64
11 Cabin Union{Missing, String15}
12 Embarked Union{Missing, String1}

1c. A Glimpse of the Datasets.


Train Set

DF.first(train[shuffle(1:DF.nrow(train))[1:5], :], 5)
5×12 DataFrame
Row PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Int64 Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64 String15? String1?
1 575 0 3 Rush, Mr. Alfred George John male 16.0 0 0 A/4. 20589 8.05 missing S
2 439 0 1 Fortune, Mr. Mark male 64.0 1 4 19950 263.0 C23 C25 C27 S
3 111 0 1 Porter, Mr. Walter Chamberlain male 47.0 0 0 110465 52.0 C110 S
4 570 1 3 Jonsson, Mr. Carl male 32.0 0 0 350417 7.8542 missing S
5 802 1 2 Collyer, Mrs. Harvey (Charlotte Annie Tate) female 31.0 1 1 C.A. 31921 26.25 missing S

Test Set

DF.first(test[shuffle(1:DF.nrow(test))[1:5], :], 5)
5×11 DataFrame
Row PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64? String15? String1
1 1261 2 Pallas y Castello, Mr. Emilio male 29.0 0 0 SC/PARIS 2147 13.8583 missing C
2 1238 2 Botsford, Mr. William Hull male 26.0 0 0 237670 13.0 missing S
3 986 1 Birnbaum, Mr. Jakob male 25.0 0 0 13905 26.0 missing C
4 1007 3 Chronopoulos, Mr. Demetrios male 18.0 1 0 2680 14.4542 missing C
5 900 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) female 18.0 0 0 2657 7.2292 missing C

This is a sample of train and test dataset. Lets find out a bit more about the train and test dataset.

println("The shape of the train data is (row, column): $(size(train))")
println("Train dataset info:")
DF.describe(train)


println("The shape of the test data is (row, column): $(size(test))")
println("Test dataset info:")
DF.describe(test)
The shape of the train data is (row, column): (891, 12)
Train dataset info:
The shape of the test data is (row, column): (418, 11)
Test dataset info:
11×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 PassengerId 1100.5 892 1100.5 1309 0 Int64
2 Pclass 2.26555 1 3.0 3 0 Int64
3 Name Abbott, Master. Eugene Joseph van Billiard, Master. Walter John 0 String
4 Sex female male 0 String7
5 Age 30.2726 0.17 27.0 76.0 86 Union{Missing, Float64}
6 SibSp 0.447368 0 0.0 8 0 Int64
7 Parch 0.392344 0 0.0 9 0 Int64
8 Ticket 110469 W.E.P. 5734 0 String31
9 Fare 35.6272 0.0 14.4542 512.329 1 Union{Missing, Float64}
10 Cabin A11 G6 327 Union{Missing, String15}
11 Embarked C S 0 String1

1d. About This Dataset


The data has split into two groups:

  • training set (train.csv)
  • test set (test.csv)

The training set includes our target variable(dependent variable), passenger survival status (also known as the ground truth from the Titanic tragedy) along with other independent features like gender, class, fare, and Pclass.

The test set should be used to see how well our model performs on unseen data. When we say unseen data, we mean that the algorithm or machine learning models have no relation to the test data. We do not want to use any part of the test data in any way to modify our algorithms; Which are the reasons why we clean our test data and train data separately. The test set does not provide passengers survival status. We are going to use our model to predict passenger survival status.

Now let’s go through the features and describe a little. There is a couple of different type of variables, They are…


Categorical:

  • Nominal(variables that have two or more categories, but which do not have an intrinsic order.) > - Cabin > - Embarked(Port of Embarkation) C(Cherbourg) Q(Queenstown) S(Southampton)

  • Dichotomous(Nominal variable with only two categories) > - Sex Female Male

  • Ordinal(variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.) > - Pclass (A proxy for socio-economic status (SES)) 1(Upper) 2(Middle) 3(Lower)


Numeric:

  • Discrete > - Passenger ID(Unique identifing # for each passenger) > - SibSp > - Parch > - Survived (Our outcome or dependent variable) 0 1
  • Continous > - Age > - Fare

Text Variable

  • Ticket (Ticket number for passenger.)
  • Name( Name of the passenger.)

1e. Tableau Visualization of the Data


I have incorporated a tableau visualization below of the training data. This visualization…

  • is for us to have an overview and play around with the dataset.
  • is done without making any changes(including Null values) to any features of the dataset.

Let’s get a better perspective of the dataset through this visualization.

We want to see how the left vertical bar changes when we filter out unique values of certain features. We can use multiple filters to see if there are any correlations among them. For example, if we click on upper and Female tab, we would see that green color dominates the bar with a ratio of 91:3 survived and non survived female passengers; a 97% survival rate for females. We can reset the filters by clicking anywhere in the whilte space. The age distribution chart on top provides us with some more info such as, what was the age range of those three unlucky females as the red color give away the unsurvived once. If you would like to check out some of my other tableau charts, please click here.

Part 2: Overview and Cleaning the Data


2a. Overview

Datasets in the real world are often messy, However, this dataset is almost clean. Lets analyze and see what we have here.

::: {#22 .cell _cell_guid=‘bf19c831-fbe0-49b6-8bf8-d7db118f40b1’ _kg_hide-input=‘true’ _uuid=‘5a0593fb4564f0284ca7fdf5c006020cb288db95’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:08.956119Z”,“iopub.status.busy”:“2021-06-26T16:35:08.955538Z”,“iopub.status.idle”:“2021-06-26T16:35:08.973222Z”,“shell.execute_reply”:“2021-06-26T16:35:08.972151Z”,“shell.execute_reply.started”:“2021-06-26T16:35:08.956072Z”}’ execution_count=1}

DF.describe(train, :nmissing, :eltype)
12×3 DataFrame
Row variable nmissing eltype
Symbol Int64 Type
1 PassengerId 0 Int64
2 Survived 0 Int64
3 Pclass 0 Int64
4 Name 0 String
5 Sex 0 String7
6 Age 177 Union{Missing, Float64}
7 SibSp 0 Int64
8 Parch 0 Int64
9 Ticket 0 String31
10 Fare 0 Float64
11 Cabin 687 Union{Missing, String15}
12 Embarked 2 Union{Missing, String1}

:::

It looks like, the features have unequal amount of data entries for every column and they have many different types of variables. This can happen for the following reasons…

  • We may have missing values in our features.
  • We may have categorical features.
  • We may have alphanumerical or/and text features.

2b. Dealing with Missing values


Missing values in train dataset.

::: {#24 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:08.975451Z”,“iopub.status.busy”:“2021-06-26T16:35:08.974927Z”,“iopub.status.idle”:“2021-06-26T16:35:08.98326Z”,“shell.execute_reply”:“2021-06-26T16:35:08.982644Z”,“shell.execute_reply.started”:“2021-06-26T16:35:08.975205Z”}’ execution_count=1}

function missing_percentage(df::DF.DataFrame)
    """This function takes a DataFrame as input and returns total missing values and percentages"""
    missing_counts = [count(ismissing, df[!, col]) for col in DF.names(df)]
    missing_pct = round.(missing_counts ./ DF.nrow(df) .* 100, digits=2)

    # Create result DataFrame
    result = DF.DataFrame(
        Column = DF.names(df),
        Total = missing_counts,
        Percent = missing_pct
    )

    # Sort by total missing values (descending)
    return DF.sort(result, :Total, rev=true)
end
missing_percentage (generic function with 1 method)

:::

missing_percentage(train)
12×3 DataFrame
Row Column Total Percent
String Int64 Float64
1 Cabin 687 77.1
2 Age 177 19.87
3 Embarked 2 0.22
4 PassengerId 0 0.0
5 Survived 0 0.0
6 Pclass 0 0.0
7 Name 0 0.0
8 Sex 0 0.0
9 SibSp 0 0.0
10 Parch 0 0.0
11 Ticket 0 0.0
12 Fare 0 0.0

Missing values in test set.

::: {#28 .cell _cell_guid=‘073ef91b-e401-47a1-9b0a-d08ad710abce’ _kg_hide-input=‘true’ _uuid=‘1ec1de271f57c9435ce111261ba08c5d6e34dbcb’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.208229Z”,“iopub.status.busy”:“2021-06-26T16:35:09.207968Z”,“iopub.status.idle”:“2021-06-26T16:35:09.221423Z”,“shell.execute_reply”:“2021-06-26T16:35:09.220732Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.208186Z”}’ execution_count=1}

missing_percentage(test)
11×3 DataFrame
Row Column Total Percent
String Int64 Float64
1 Cabin 327 78.23
2 Age 86 20.57
3 Fare 1 0.24
4 PassengerId 0 0.0
5 Pclass 0 0.0
6 Name 0 0.0
7 Sex 0 0.0
8 SibSp 0 0.0
9 Parch 0 0.0
10 Ticket 0 0.0
11 Embarked 0 0.0

:::

We see that in both train, and test dataset have missing values. Let’s make an effort to fill these missing values starting with “Embarked” feature.

Embarked feature


::: {#30 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.223175Z”,“iopub.status.busy”:“2021-06-26T16:35:09.222681Z”,“iopub.status.idle”:“2021-06-26T16:35:09.230671Z”,“shell.execute_reply”:“2021-06-26T16:35:09.229793Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.223128Z”}’ execution_count=1}

function percent_value_counts(df::DF.DataFrame, feature::Symbol)
    """This function takes a dataframe and a column and finds the percentage of the value_counts"""

    # Count values including missing
    counts = DF.combine(DF.groupby(df, feature), DF.nrow => :Total)

    # Calculate percentages
    counts.Percent = round.(counts.Total ./ DF.nrow(df) .* 100, digits=2)

    # Sort by total count (descending)
    return DF.sort(counts, :Total, rev=true)
end
percent_value_counts (generic function with 1 method)

:::

::: {#32 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.236974Z”,“iopub.status.busy”:“2021-06-26T16:35:09.236548Z”,“iopub.status.idle”:“2021-06-26T16:35:09.254321Z”,“shell.execute_reply”:“2021-06-26T16:35:09.253654Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.236929Z”}’ execution_count=1}

percent_value_counts(train, :Embarked)
4×3 DataFrame
Row Embarked Total Percent
String1? Int64 Float64
1 S 644 72.28
2 C 168 18.86
3 Q 77 8.64
4 missing 2 0.22

:::

It looks like there are only two null values( ~ 0.22 %) in the Embarked feature, we can replace these with the mode value “S”. However, let’s dig a little deeper.

Let’s see what are those two null values

::: {#34 .cell _cell_guid=‘000ebdd7-ff57-48d9-91bf-a29ba79f1a1c’ _kg_hide-input=‘true’ _uuid=‘6b9cb050e9dae424bb738ba9cdf3c84715887fa3’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.276102Z”,“iopub.status.busy”:“2021-06-26T16:35:09.275649Z”,“iopub.status.idle”:“2021-06-26T16:35:09.292037Z”,“shell.execute_reply”:“2021-06-26T16:35:09.291163Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.275879Z”}’ execution_count=1}

train[ismissing.(train.Embarked), :]
2×12 DataFrame
Row PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Int64 Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64 String15? String1?
1 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 missing
2 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 missing

:::

We may be able to solve these two missing values by looking at other independent variables of the two raws. Both passengers paid a fare of $80, are of Pclass 1 and female Sex. Let’s see how the Fare is distributed among all Pclass and Embarked feature values

::: {#36 .cell _cell_guid=‘bf257322-0c9c-4fc5-8790-87d8c94ad28a’ _kg_hide-input=‘true’ _uuid=‘ad15052fe6cebe37161c6e01e33a5c083dc2b558’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.293919Z”,“iopub.status.busy”:“2021-06-26T16:35:09.293564Z”,“iopub.status.idle”:“2021-06-26T16:35:09.866643Z”,“shell.execute_reply”:“2021-06-26T16:35:09.865701Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.293817Z”}’ execution_count=1}

fig = Makie.Figure()

# Prepare data for plotting
train_clean = DF.dropmissing(train, [:Embarked, :Fare, :Pclass])
test_clean = DF.dropmissing(test, [:Embarked, :Fare, :Pclass])

# Create mapping for embarked ports to numbers
unique_categories = unique(train_clean.Embarked)
category_to_index = Dict(category => i for (i, category) in enumerate(unique_categories))
# Convert categorical to numeric
train_clean.Embarked_num = [category_to_index[port] for port in train_clean.Embarked]
test_clean.Embarked_num = [category_to_index[port] for port in test_clean.Embarked]

# Training set boxplot
ax1 = Makie.Axis(fig[1, 1],
    title = "Training Set",
    xlabel = "Embarked",
    ylabel = "Fare",
    xticks = (1:3, unique_categories)
)

ax2 = Makie.Axis(fig[1, 2],
    title = "Test Set",
    xlabel = "Embarked",
    ylabel = "Fare",
    xticks = (1:3, unique_categories)
)

Makie.boxplot!(ax2, test_clean.Embarked_num, test_clean.Fare,
           dodge = test_clean.Pclass,
           color = test_clean.Pclass)
Makie.boxplot!(ax1, train_clean.Embarked_num, train_clean.Fare,
           dodge = train_clean.Pclass,
           color = train_clean.Pclass)

fig

:::

Here, in both training set and test set, the average fare closest to $80 are in the C Embarked values where pclass is 1. So, let’s fill in the missing values as “C”

::: {#38 .cell _cell_guid=‘2f5f3c63-d22c-483c-a688-a5ec2a477330’ _kg_hide-input=‘true’ _uuid=‘52e51ada5dfeb700bf775c66e9307d6d1e2233de’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.868523Z”,“iopub.status.busy”:“2021-06-26T16:35:09.868016Z”,“iopub.status.idle”:“2021-06-26T16:35:09.874135Z”,“shell.execute_reply”:“2021-06-26T16:35:09.873022Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.868249Z”}’ scrolled=‘true’ execution_count=1}

## Replacing the null values in the Embarked column with the mode.
train.Embarked = coalesce.(train.Embarked, "C");

:::

Cabin Feature


::: {#40 .cell _cell_guid=‘e76cd770-b498-4444-b47a-4ac6ae63193b’ _kg_hide-input=‘true’ _uuid=‘b809a788784e2fb443457d7ef4ca17a896bf58b4’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.876171Z”,“iopub.status.busy”:“2021-06-26T16:35:09.875621Z”,“iopub.status.idle”:“2021-06-26T16:35:09.886193Z”,“shell.execute_reply”:“2021-06-26T16:35:09.885088Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.875859Z”}’ scrolled=‘true’ execution_count=1}

println("Train Cabin missing: $(count(ismissing, train.Cabin) / DF.nrow(train))")
println("Test Cabin missing: $(count(ismissing, test.Cabin) / DF.nrow(test))")
Train Cabin missing: 0.7710437710437711
Test Cabin missing: 0.7822966507177034

:::

Approximately 77% of Cabin feature is missing in the training data and 78% missing on the test data. We have two choices,

  • we can either get rid of the whole feature, or
  • we can brainstorm a little and find an appropriate way to put them in use. For example, We may say passengers with cabin record had a higher socio-economic-status then others. We may also say passengers with cabin record were more likely to be taken into consideration when loading into the boat.

Let’s combine train and test data first and for now, will assign all the null values as “N”

::: {#42 .cell _kg_hide-input=‘true’ _uuid=‘8ff7b4f88285bc65d72063d7fdf8a09a5acb62d3’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.888377Z”,“iopub.status.busy”:“2021-06-26T16:35:09.88784Z”,“iopub.status.idle”:“2021-06-26T16:35:09.902296Z”,“shell.execute_reply”:“2021-06-26T16:35:09.901697Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.888114Z”}’ execution_count=1}

survivors = train.Survived
DF.select!(train, DF.Not(:Survived))  # Remove Survived column
all_data = vcat(train, test)

all_data.Cabin = coalesce.(all_data.Cabin, "N");

:::

All the cabin names start with an English alphabet following by multiple digits. It seems like there are some passengers that had booked multiple cabin rooms in their name. This is because many of them travelled with family. However, they all seem to book under the same letter followed by different numbers. It seems like there is a significance with the letters rather than the numbers. Therefore, we can group these cabins according to the letter of the cabin name.

::: {#44 .cell _cell_guid=‘87995359-8a77-4e38-b8bb-e9b4bdeb17ed’ _kg_hide-input=‘true’ _uuid=‘c1e9e06eb7f2a6eeb1a6d69f000217e7de7d5f25’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.904181Z”,“iopub.status.busy”:“2021-06-26T16:35:09.903766Z”,“iopub.status.idle”:“2021-06-26T16:35:09.909654Z”,“shell.execute_reply”:“2021-06-26T16:35:09.908573Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.904014Z”}’ execution_count=1}

all_data.Cabin = [string(cabin[1]) for cabin in all_data.Cabin];

:::

Now let’s look at the value counts of the cabin features and see how it looks.

::: {#46 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.91156Z”,“iopub.status.busy”:“2021-06-26T16:35:09.911098Z”,“iopub.status.idle”:“2021-06-26T16:35:09.928945Z”,“shell.execute_reply”:“2021-06-26T16:35:09.928025Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.911398Z”}’ execution_count=1}

percent_value_counts(all_data, :Cabin)
9×3 DataFrame
Row Cabin Total Percent
String Int64 Float64
1 N 1014 77.46
2 C 94 7.18
3 B 65 4.97
4 D 46 3.51
5 E 41 3.13
6 A 22 1.68
7 F 21 1.6
8 G 5 0.38
9 T 1 0.08

:::

So, We still haven’t done any effective work to replace the null values. Let’s stop for a second here and think through how we can take advantage of some of the other features here.

  • We can use the average of the fare column We can use pythons groupby function to get the mean fare of each cabin letter.

::: {#48 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.930774Z”,“iopub.status.busy”:“2021-06-26T16:35:09.930283Z”,“iopub.status.idle”:“2021-06-26T16:35:09.942122Z”,“shell.execute_reply”:“2021-06-26T16:35:09.941067Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.930532Z”}’ execution_count=1}

@chain all_data begin
    DF.dropmissing(:Fare)
    DF.groupby(:Cabin)
    DF.combine(:Fare => Stats.mean => :Mean_Fare)
    DF.sort(:Mean_Fare)
end
9×2 DataFrame
Row Cabin Mean_Fare
String Float64
1 G 14.205
2 F 18.0794
3 N 19.1327
4 T 35.5
5 A 41.2443
6 D 53.0073
7 E 54.5646
8 C 107.927
9 B 122.383

:::

Now, these means can help us determine the unknown cabins, if we compare each unknown cabin rows with the given mean’s above. Let’s write a simple function so that we can give cabin names based on the means.

::: {#50 .cell _kg_hide-input=‘true’ _uuid=‘a466da29f1989fa983147faf9e63d18783468567’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.943855Z”,“iopub.status.busy”:“2021-06-26T16:35:09.943364Z”,“iopub.status.idle”:“2021-06-26T16:35:09.952677Z”,“shell.execute_reply”:“2021-06-26T16:35:09.952057Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.943627Z”}’ execution_count=1}

function cabin_estimator(fare::Union{Float64, Missing})
    """Grouping cabin feature by the first letter based on fare"""
    # Handle missing values
    if ismissing(fare)
        return "N"  # Default cabin for missing fare
    end
    
    if fare < 16
        return "G"
    elseif 16  fare < 27
        return "F"
    elseif 27  fare < 38
        return "T"
    elseif 38  fare < 47
        return "A"
    elseif 47  fare < 53
        return "E"
    elseif 53  fare < 54
        return "D"
    elseif 54  fare < 116
        return "C"
    else
        return "B"
    end
end
cabin_estimator (generic function with 1 method)

:::

Let’s apply cabin_estimator function in each unknown cabins(cabin with null values). Once that is done we will separate our train and test to continue towards machine learning modeling.

with_N = all_data[all_data.Cabin .== "N", :]
without_N = all_data[all_data.Cabin .!= "N", :];

::: {#54 .cell _kg_hide-input=‘true’ _uuid=‘1c646b64c6e062656e5f727d5499266f847c4832’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:09.965179Z”,“iopub.status.busy”:“2021-06-26T16:35:09.96464Z”,“iopub.status.idle”:“2021-06-26T16:35:09.981536Z”,“shell.execute_reply”:“2021-06-26T16:35:09.980705Z”,“shell.execute_reply.started”:“2021-06-26T16:35:09.964885Z”}’ execution_count=1}

with_N.Cabin = cabin_estimator.(with_N.Fare)

# Combine back together
all_data = vcat(with_N, without_N)

# Sort by PassengerId
DF.sort!(all_data, :PassengerId)

# Separate train and test
train = all_data[1:891, :]
test = all_data[892:end, :]

# Add back survival information
train.Survived = survivors;

:::

Fare Feature


If you have paid attention so far, you know that there is only one missing value in the fare column. Let’s have it.

test[ismissing.(test.Fare), :]
1×11 DataFrame
Row PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64? String Abstract…
1 1044 3 Storey, Mr. Thomas male 60.5 0 0 3701 missing N S

Here, We can take the average of the Fare column to fill in the NaN value. However, for the sake of learning and practicing, we will try something else. We can take the average of the values wherePclass is 3, Sex is male and Embarked is S

::: {#58 .cell _cell_guid=‘e742aa76-b6f8-4882-8bd6-aa10b96f06aa’ _kg_hide-input=‘true’ _uuid=‘f1dc8c6c33ba7df075ee608467be2a83dc1764fd’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:10.002749Z”,“iopub.status.busy”:“2021-06-26T16:35:10.002232Z”,“iopub.status.idle”:“2021-06-26T16:35:10.012662Z”,“shell.execute_reply”:“2021-06-26T16:35:10.011431Z”,“shell.execute_reply.started”:“2021-06-26T16:35:10.00248Z”}’ execution_count=1}

missing_value = @chain test begin
    DF.subset(:Pclass => x -> x .== 3, :Embarked => x -> x .== "S", :Sex => x -> x .== "male")
    _.Fare
    skipmissing
    Stats.mean
end

test.Fare = coalesce.(test.Fare, missing_value);

:::

Age Feature


We know that the feature “Age” is the one with most missing values, let’s see it in terms of percentage.

::: {#60 .cell _cell_guid=‘8ff25fb3-7a4a-4e06-b48f-a06b8d844917’ _kg_hide-input=‘true’ _uuid=‘c356e8e85f53a27e44b5f28936773a289592c5eb’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:10.014347Z”,“iopub.status.busy”:“2021-06-26T16:35:10.014023Z”,“iopub.status.idle”:“2021-06-26T16:35:10.024214Z”,“shell.execute_reply”:“2021-06-26T16:35:10.023404Z”,“shell.execute_reply.started”:“2021-06-26T16:35:10.014284Z”}’ execution_count=1}

println("Train age missing value: $(round(count(ismissing, train.Age) / DF.nrow(train) * 100, digits=2))%")
println("Test age missing value: $(round(count(ismissing, test.Age) / DF.nrow(test) * 100, digits=2))%")
Train age missing value: 19.87%
Test age missing value: 20.57%

:::

We will take a different approach since ~20% data in the Age column is missing in both train and test dataset. The age variable seems to be promising for determining survival rate. Therefore, It would be unwise to replace the missing values with median, mean or mode. We will use machine learning model Random Forest Regressor to impute missing value instead of Null value. We will keep the age column unchanged for now and work on that in the feature engineering section.

Part 3. Visualization and Feature Relations


Before we dive into finding relations between independent variables and our dependent variable(survivor), let us create some assumptions about how the relations may turn-out among features.

Assumptions:

  • Gender: More female survived than male
  • Pclass: Higher socio-economic status passenger survived more than others.
  • Age: Younger passenger survived more than other passengers.
  • Fare: Passenger with higher fare survived more that other passengers. This can be quite correlated with Pclass.

Now, let’s see how the features are related to each other by creating some visualizations.

3a. Gender and Survived


Makie.set_theme!(Makie.theme_light())
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Survived/Non-Survived Passenger Gender Distribution",
    xlabel = "Sex",
    ylabel = "% of passenger survived",
    xticks= (1:2, ["Male", "Female"]),
    
)

# Calculate survival rates by gender
survival_by_sex = @chain train begin
    DF.groupby(:Sex)
    DF.combine(:Survived => Stats.mean => :survival_rate)
    DF.sort(:Sex, rev=true)  # Female first
end

# Create elegant barplot
Makie.barplot!(ax, 1:2, survival_by_sex.survival_rate, 
           color = ["green", "pink"])

fig

This bar plot above shows the distribution of female and male survived. The x_label represents Sex feature while the y_label represents the % of passenger survived. This bar plot shows that ~74% female passenger survived while only ~19% male passenger survived.

fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "Passenger Gender Distribution - Survived vs Not-survived",
    xlabel = "Sex",
    ylabel = "# of Passenger Survived",
    xticks = (1:2, ["Male", "Female"])
)

# Count data for grouped bar chart
count_data = @chain train begin
    DF.groupby([:Sex, :Survived])
    DF.combine(DF.nrow => :count)
    DF.unstack(:Survived, :count, fill=0)
end

# Create grouped bar chart
counts = [count_data[1, 2], count_data[1, 3], count_data[2, 2], count_data[2, 3]]


Makie.barplot!(ax, [1, 1, 2, 2], counts,
           dodge = [1, 2, 1,2],
           color = ["gray", "green", "gray", "green"])



# Add legend
Makie.Legend(fig[1, 2], 
    [Makie.PolyElement(color = "gray"), Makie.PolyElement(color = "green")],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

This count plot shows the actual distribution of male and female passengers that survived and did not survive. It shows that among all the females ~ 230 survived and ~ 70 did not survive. While among male passengers ~110 survived and ~480 did not survive.

Summary


  • As we suspected, female passengers have survived at a much better rate than male passengers.
  • It seems about right since females and children were the priority.

3b. Pclass and Survived


fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "Passenger Class Distribution - Survival Percentage",
    xlabel = "Passenger Class",
    ylabel = "Percentage",
    titlesize = 20,
    xlabelsize = 16,
    ylabelsize = 16,
    xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"])
)

# Calculate percentages by class
class_survival = @chain train begin
    DF.groupby([:Pclass, :Survived])
    DF.combine(DF.nrow => :count)
    DF.unstack(:Survived, :count, fill=0)
end

no_survived = class_survival[:, 2]
yes_survived = class_survival[:, 3]
total_by_class = no_survived + yes_survived

survived_percentage = (yes_survived ./ total_by_class) * 100
not_survived_percentage = (no_survived ./ total_by_class) * 100

flatten = vcat(not_survived_percentage ,survived_percentage)

Makie.barplot!(ax, [1, 2, 3, 1, 2, 3], flatten, stack=[1, 2, 3, 1, 2, 3], color = ["red", "red", "red", "green", "green", "green"], strokewidth = 1, strokecolor = :black)

# Add legend
Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#F44336"), Makie.PolyElement(color = "#4CAF50")],
    ["Not Survived", "Survived"],
    "Survival Status")

fig
Makie.barplot([1, 2, 3], survived_percentage, axis=(xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"]), title = "Passenger Class Distribution - Survived vs Non-Survived"), color=["brown", "orange", "green"])
  • It looks like …
    • ~ 63% first class passenger survived titanic tragedy, while
    • ~ 48% second class and
    • ~ only 24% third class passenger survived.

fig = Makie.Figure(
    title = "Passenger Class Distribution - Survived vs Non-Survived",
    xlabel = "Passenger Class",
    ylabel = "Density of Passenger Survived",
) # Adjust figure size as needed
ax =  Makie.Axis(fig[1, 1], xticks = ([1, 2, 3], ["Upper", "Middle", "Lower"]))           

d1 = Makie.density!(ax, train.Pclass[train.Survived .== 0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)

d2= Makie.density!(ax, train.Pclass[train.Survived .== 1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

This KDE plot is pretty self-explanatory with all the labels and colors. Something I have noticed that some readers might find questionable is that the lower class passengers have survived more than second-class passengers. It is true since there were a lot more third-class passengers than first and second.

Summary


The first class passengers had the upper hand during the tragedy. You can probably agree with me more on this, in the next section of visualizations where we look at the distribution of ticket fare and survived column.

3c. Fare and Survived


fig = Makie.Figure()

ax = Makie.Axis(fig[1, 1],
    title = "Fare Distribution - Survived vs Non-Survived",
    xlabel = "Fare",
    ylabel = "Density of Passenger Survived",
)
 
d1 = Makie.density!(ax, train.Fare[train.Survived .== 0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)
d2 = Makie.density!(ax, train.Fare[train.Survived .== 1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")
fig

This plot shows something impressive..

  • The spike in the plot under 100 dollar represents that a lot of passengers who bought the ticket within that range did not survive.
  • When fare is approximately more than 280 dollars, there is no gray shade which means, either everyone passed that fare point survived or maybe there is an outlier that clouds our judgment. Let’s check…
train[train.Fare .> 280, :]
3×12 DataFrame
Row PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Survived
Int64 Int64 String String7 Float64? Int64 Int64 String31 Float64? String Abstract… Int64
1 259 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.329 B C 1
2 680 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.329 B C 1
3 738 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.329 B C 1

As we assumed, it looks like an outlier with a fare of $512. We sure can delete this point. However, we will keep it for now.

3d. Age and Survived



fig = Makie.Figure()

ax = Makie.Axis(fig[1, 1], title = "Age Distribution - Survived vs Non-Survived",
    xlabel = "Age",
    ylabel = "Density of Passenger Survived")


# clean missing first
clean_train =  DF.dropmissing(train, :Age)
not_survived = clean_train.Age[clean_train.Survived .== 0]
survived = clean_train.Age[clean_train.Survived .== 1]

d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)
d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

There is nothing out of the ordinary about this plot, except the very left part of the distribution. This may hint on the posibility that children and infants were the priority.

3e. Combined Feature Relations


In this section, we are going to discover more than two feature relations in a single graph. I will try my best to illustrate most of the feature relations. Let’s get to it.


fig = Makie.Figure(title="Survived by Sex and Age")

# Create subplots for each combination

for (i, (sex, survived)) in enumerate(Iterators.product(["female", "male"], [0, 1]))

    ax = Makie.Axis(fig[div(i - 1, 2) + 1, i % 2 + 1],
        title = "$sex $(survived == 1 ? "Survived" : "Not Survived")",
        xlabel = "Age",
        ylabel = "Count"
    )
    
    subset_data = train[(train.Sex .== sex) .& (train.Survived .== survived) .& .!ismissing.(train.Age), :]
    
    Makie.hist!(ax, subset_data.Age, bins = 20, 
            color = survived == 1 ? "green" : "gray",
            strokewidth = 1, strokecolor = :white)
   
end

fig

Facetgrid is a great way to visualize multiple variables and their relationships at once. From the chart in section 3a we have a intuation that female passengers had better prority than males during the tragedy. However, from this facet grid, we can also understand which age range groups survived more than others or were not so lucky

fig = Makie.Figure(title="Survived by Sex and Age")

# Create subplots for each combination
for (i, (sex, embarked)) in enumerate(Iterators.product(["female", "male"], ["S", "C", "Q"]))

    ax = Makie.Axis(fig[div(i - 1, 2) + 1, i % 2 + 1],
        title = "$sex $embarked",
    )

    subset_data = train[(train.Sex .== sex) .& (train.Embarked .== embarked) .& .!ismissing.(train.Age), :]

    for (survived) in [0, 1]
        subset_survived = subset_data[(subset_data.Survived .== survived), :]
        println("Length of subset: $(DF.nrow(subset_survived))")

        if DF.nrow(subset_data) > 0
             Makie.hist!(ax, subset_survived.Age, 
                        bins = 20,
                        color = survived == 1 ? (:green, 0.5) : (:gray, 0.5),
                        strokewidth = 1, 
                        strokecolor = :white,
                        label = survived == 1 ? "Survived" : "Not Survived"
                    )
        end
    end
end


Makie.Legend(fig[1, 3], 
    [Makie.PolyElement(color = (:gray, 0.7)), 
     Makie.PolyElement(color = (:green, 0.7))],
    ["Not Survived", "Survived"],
    "Survival Status"
)

fig
Length of subset: 53
Length of subset: 133
Length of subset: 300
Length of subset: 68
Length of subset: 6
Length of subset: 57
Length of subset: 45
Length of subset: 24
Length of subset: 5
Length of subset: 7
Length of subset: 15
Length of subset: 1

This is another compelling facet grid illustrating four features relationship at once. They are Embarked, Age, Survived & Sex.

  • The color illustrates passengers survival status(green represents survived, gray represents not survived)
  • The column represents Sex(left being male, right stands for female)
  • The row represents Embarked(from top to bottom: S, C, Q)

Now that I have steered out the apparent let’s see if we can get some insights that are not so obvious as we look at the data.

  • Most passengers seem to be boarded on Southampton(S).
  • More than 60% of the passengers died boarded on Southampton.
  • More than 60% of the passengers lived boarded on Cherbourg(C).
  • Pretty much every male that boarded on Queenstown(Q) did not survive.
  • There were very few females boarded on Queenstown, however, most of them survived.
fig = Makie.Figure(resolution = (1000, 600))

ax_m = Makie.Axis(fig[1, 1],
    title = "Male", 
    xlabel = "Fare",
    ylabel = "Age")
# Female subplot
ax_f = Makie.Axis(fig[1, 2], 
    title = "Female",
    xlabel = "Fare",
    ylabel = "Age")

female_data = train[(train.Sex .== "female") .& .!ismissing.(train.Age), :]
male_data = train[(train.Sex .== "male") .& .!ismissing.(train.Age), :]


Makie.scatter!(ax_m, male_data.Fare, male_data.Age,
           color = [s == 1 ? "green" : "gray" for s in male_data.Survived],
           strokewidth=1, strokecolor="white", markersize=14)
Makie.scatter!(ax_f, female_data.Fare, female_data.Age,
           color = [s == 1 ? "green" : "gray" for s in female_data.Survived],
           strokewidth=1, strokecolor="white", markersize=14)


# Add legend
Makie.Legend(fig[1, 3],
    [Makie.MarkerElement(color = "gray", marker = :circle), 
     Makie.MarkerElement(color = "green", marker = :circle)],
    ["Not Survived", "Survived"],
    "Survived")

Makie.Label(fig[0, :], "Survived by Sex, Fare and Age")
fig
Warning: Found `resolution` in the theme when creating a `Scene`. The `resolution` keyword for `Scene`s and `Figure`s has been deprecated. Use `Figure(; size = ...` or `Scene(; size = ...)` instead, which better reflects that this is a unitless size and not a pixel resolution. The key could also come from `set_theme!` calls or related theming functions.
@ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\scenes.jl:259

This facet grid unveils a couple of interesting insights. Let’s find out.

  • The grid above clearly demonstrates the three outliers with Fare of over $500. At this point, I think we are quite confident that these outliers should be deleted.
  • Most of the passengers were with in the Fare range of $100.
train = train[train.Fare .< 500, :]

fig = Makie.Figure(size = (800, 600))
ax = Makie.Axis(fig[1, 1],
    title = "Parents/Children Survival Rate",
    xlabel = "Number of Parents/Children",
    ylabel = "Survival Rate",
)

parch_survival = @chain train_clean begin
    DF.groupby(:Parch)
    DF.combine(
        :Survived => Stats.mean => :survival_rate,
        :Survived => Stats.std => :std_dev,
        :Survived => length => :count
    )
end

parch_survival.std_error = parch_survival.std_dev ./ sqrt.(parch_survival.count)

Makie.scatterlines!(ax, parch_survival.Parch, parch_survival.survival_rate,
    color = "#2196F3", 
    linewidth = 3,
    markersize = 8
)

error = Makie.errorbars!(ax, parch_survival.Parch, parch_survival.survival_rate, 
    parch_survival.std_error,
    color = "blue",
    linewidth = 2,
    whiskerwidth = 8
)

Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#2196F3"), Makie.PolyElement(color = "blue")],
    ["Survival Rate", "Standard Error"],
    "Legend"
)
fig

Passenger who traveled in big groups with parents/children had less survival rate than other passengers.

fig = Makie.Figure(size = (800, 600))
ax = Makie.Axis(fig[1, 1],
    title = "Siblings/Spouses Survival Rate",
    xlabel = "Number of Siblings/Spouses",
    ylabel = "Survival Rate",
)

sibsp_survival = @chain train_clean begin
    DF.groupby(:SibSp)
    DF.combine(
        :Survived => Stats.mean => :survival_rate,
        :Survived => Stats.std => :std_dev,
        :Survived => length => :count
    )
end

sibsp_survival.std_error = sibsp_survival.std_dev ./ sqrt.(sibsp_survival.count)

Makie.scatterlines!(ax, sibsp_survival.SibSp, sibsp_survival.survival_rate,
    color = "#2196F3", 
    linewidth = 3,
    markersize = 8
)

error = Makie.errorbars!(ax, sibsp_survival.SibSp, sibsp_survival.survival_rate, 
    sibsp_survival.std_error,
    color = "blue",
    linewidth = 2,
    whiskerwidth = 8
)

Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#2196F3"), Makie.PolyElement(color = "blue")],
    ["Survival Rate", "Standard Error"],
    "Legend"
)
fig

While, passenger who traveled in small groups with sibilings/spouses had better changes of survivint than other passengers.

train.Sex = [sex == "female" ? 0 : 1 for sex in train.Sex]
test.Sex = [sex == "female" ? 0 : 1 for sex in test.Sex];

Part 4: Statistical Overview


title

Train info

DF.describe(train)
12×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Union… Any Union… Any Int64 Type
1 PassengerId 445.618 1 445.5 891 0 Int64
2 Pclass 2.31306 1 3.0 3 0 Int64
3 Name Abbing, Mr. Anthony van Melkebeke, Mr. Philemon 0 String
4 Sex 0.647523 0 1.0 1 0 Int64
5 Age 29.6753 0.42 28.0 80.0 177 Union{Missing, Float64}
6 SibSp 0.524775 0 0.0 8 0 Int64
7 Parch 0.381757 0 0.0 6 0 Int64
8 Ticket 110152 WE/P 5735 0 String31
9 Fare 30.5822 0.0 14.4542 263.0 0 Union{Missing, Float64}
10 Cabin A T 0 String
11 Embarked C S 0 AbstractString
12 Survived 0.381757 0 0.0 1 0 Int64
categorical_cols = [col for col in names(train) if eltype(train[!, col]) <: Union{String, AbstractString}]
DF.describe(train[!, categorical_cols])
4×7 DataFrame
Row variable mean min median max nmissing eltype
Symbol Nothing Abstract… Nothing Abstract… Int64 DataType
1 Name Abbing, Mr. Anthony van Melkebeke, Mr. Philemon 0 String
2 Ticket 110152 WE/P 5735 0 String31
3 Cabin A T 0 String
4 Embarked C S 0 AbstractString
survived_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Survived)
    DF.combine(DF.All() .=> Stats.mean)
end
2×7 DataFrame
Row Survived PassengerId_mean Pclass_mean Sex_mean SibSp_mean Parch_mean Survived_mean
Int64 Float64 Float64 Float64 Float64 Float64 Float64
1 0 447.016 2.53188 0.852459 0.553734 0.32969 0.0
2 1 443.354 1.9587 0.315634 0.477876 0.466077 1.0
sex_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Sex)
    DF.combine(DF.All() .=> Stats.mean)
end
2×7 DataFrame
Row Sex PassengerId_mean Pclass_mean Sex_mean SibSp_mean Parch_mean Survived_mean
Int64 Float64 Float64 Float64 Float64 Float64 Float64
1 0 431.578 2.16294 0.0 0.696486 0.651757 0.741214
2 1 453.261 2.39478 1.0 0.431304 0.234783 0.186087
class_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Pclass)
    DF.combine(DF.All() .=> Stats.mean)
end
3×7 DataFrame
Row Pclass PassengerId_mean Pclass_mean Sex_mean SibSp_mean Parch_mean Survived_mean
Int64 Float64 Float64 Float64 Float64 Float64 Float64
1 1 460.225 1.0 0.56338 0.422535 0.356808 0.624413
2 2 445.957 2.0 0.586957 0.402174 0.380435 0.472826
3 3 439.155 3.0 0.706721 0.615071 0.393075 0.242363

I have gathered a small summary from the statistical overview above. Let’s see what they are…

  • This train data set has 891 raw and 9 columns.
  • only 38% passenger survived during that tragedy.
  • ~74% female passenger survived, while only ~19% male passenger survived.
  • ~63% first class passengers survived, while only 24% lower class passenger survived.

4a. Correlation Matrix and Heatmap


Correlations

train_clean = DF.dropmissing(train)
train_numeric = DF.select(train_clean, DF.names(train_clean, Number)...)
corr_matrix = Stats.cor(Stats.Matrix(train_numeric))

corr_df = DF.DataFrame(corr_matrix, DF.names(train_numeric))
8×8 DataFrame
Row PassengerId Pclass Sex Age SibSp Parch Fare Survived
Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 1.0 -0.0329038 0.0211665 0.0361809 -0.0816053 -0.0122084 -0.00823828 0.0272982
2 -0.0329038 1.0 0.156916 -0.368625 0.0641473 0.0250993 -0.617708 -0.354847
3 0.0211665 0.156916 1.0 0.0932959 -0.104071 -0.248746 -0.232061 -0.541935
4 0.0361809 -0.368625 0.0932959 1.0 -0.307639 -0.189194 0.100396 -0.0794722
5 -0.0816053 0.0641473 -0.104071 -0.307639 1.0 0.384056 0.196907 -0.014598
6 -0.0122084 0.0250993 -0.248746 -0.189194 0.384056 1.0 0.258878 0.0942578
7 -0.00823828 -0.617708 -0.232061 0.100396 0.196907 0.258878 1.0 0.275122
8 0.0272982 -0.354847 -0.541935 -0.0794722 -0.014598 0.0942578 0.275122 1.0
DF.sort(DF.DataFrame(
    Variable = DF.names(corr_df),
    Correlation = abs.(corr_df[!, :Survived])
), [:Correlation], rev=true)
8×2 DataFrame
Row Variable Correlation
String Float64
1 Survived 1.0
2 Sex 0.541935
3 Pclass 0.354847
4 Fare 0.275122
5 Parch 0.0942578
6 Age 0.0794722
7 PassengerId 0.0272982
8 SibSp 0.014598

** Sex is the most important correlated feature with Survived(dependent variable) feature followed by Pclass.**

DF.sort(DF.DataFrame(
    Variable = DF.names(corr_df),
    Correlation = abs.(corr_df[!, :Survived]) .^ 2
), [:Correlation], rev=true)
8×2 DataFrame
Row Variable Correlation
String Float64
1 Survived 1.0
2 Sex 0.293693
3 Pclass 0.125916
4 Fare 0.0756923
5 Parch 0.00888453
6 Age 0.00631583
7 PassengerId 0.000745194
8 SibSp 0.000213103

Squaring the correlation feature not only gives on positive correlations but also amplifies the relationships.

n = size(corr_df, 1)

fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Correlaciones Entre Variables",  
    xticks=((1:n), DF.names(corr_df)), 
    yticks=((1:n), DF.names(corr_df)),
)

hm = Makie.heatmap!(ax, (1:n), (1:n), corr_matrix, colormap="RdBu")

for i in 1:n
    for j in 1:n
        text_val = corr_matrix[j, i]
        
        Makie.text!(ax, i, j, 
            text = string(round(text_val, digits=2)),
            color = abs(text_val) > 0.5 ? :white : :black,
            fontsize = 10,
            align = (:center, :center)
        )
    end
end

Makie.Colorbar(fig[1, 2], hm, 
    label = "Coeficiente de Correlación",
)

fig

Positive Correlation Features:

  • Fare and Survived: 0.26

Negative Correlation Features:

  • Fare and Pclass: -0.6
  • Sex and Survived: -0.55
  • Pclass and Survived: -0.33

So, Let’s analyze these correlations a bit. We have found some moderately strong relationships between different features. There is a definite positive correlation between Fare and Survived rated. This relationship reveals that the passenger who paid more money for their ticket were more likely to survive. This theory aligns with one other correlation which is the correlation between Fare and Pclass(-0.6). This relationship can be explained by saying that first class passenger(1) paid more for fare then second class passenger(2), similarly second class passenger paid more than the third class passenger(3). This theory can also be supported by mentioning another Pclass correlation with our dependent variable, Survived. The correlation between Pclass and Survived is -0.33. This can also be explained by saying that first class passenger had a better chance of surviving than the second or the third and so on.

However, the most significant correlation with our dependent variable is the Sex variable, which is the info on whether the passenger was male or female. This negative correlation with a magnitude of -0.54 which points towards some undeniable insights. Let’s do some statistics to see how statistically significant this correlation is.

4b. Statistical Test for Correlation


Statistical tests are the scientific way to prove the validation of theories. In any case, when we look at the data, we seem to have an intuitive understanding of where data is leading us. However, when we do statistical tests, we get a scientific or mathematical perspective of how significant these results are. Let’s apply some of these methods and see how we are doing with our predictions.

Hypothesis Testing Outline

A hypothesis test compares the mean of a control group and experimental group and tries to find out whether the two sample means are different from each other and if they are different, how significant that difference is.

A hypothesis test usually consists of multiple parts:

  1. Formulate a well-developed research problem or question: The hypothesis test usually starts with a concrete and well-developed researched problem. We need to ask the right question that can be answered using statistical analysis.

  2. The null hypothesis(\(H_0\)) and Alternating hypothesis(\(H_1\)): > - The null hypothesis(\(H_0\)) is something that is assumed to be true. It is the status quo. In a null hypothesis, the observations are the result of pure chance. When we set out to experiment, we form the null hypothesis by saying that there is no difference between the means of the control group and the experimental group. > - An Alternative hypothesis(\(H_A\)) is a claim and the opposite of the null hypothesis. It is going against the status quo. In an alternative theory, the observations show a real effect combined with a component of chance variation.

  3. Determine the test statistic: test statistic can be used to assess the truth of the null hypothesis. Depending on the standard deviation we either use t-statistics or z-statistics. In addition to that, we want to identify whether the test is a one-tailed test or two-tailed test. This article explains it pretty well. This article is pretty good as well.

  4. Specify a Significance level and Confidence Interval: The significance level(\(\alpha\)) is the probability of rejecting a null hypothesis when it is true. In other words, we are comfortable/confident with rejecting the null hypothesis a significant amount of times even though it is true. This considerable amount is our Significant level. In addition to that, Significance level is one minus our Confidence interval. For example, if we say, our significance level is 5%, then our confidence interval would be (1 - 0.05) = 0.95 or 95%.

  5. Compute the T-Statistics/Z-Statistics: Computing the t-statistics follows a simple equation. This equation slightly differs depending on one sample test or two sample test

  6. Compute the P-value: P-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis is correct. The p-value is known to be unintuitive, and even many professors are known to explain it wrong. I think this video explains the p-value well. The smaller the P-value, the stronger the evidence against the null hypothesis.

  7. Describe the result and compare the p-value with the significance value(\(\alpha\)): If p<=\(\alpha\), then the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. However if the p> \(\alpha\), we say that, we fail to reject the null hypothesis. Even though this sentence is grammatically wrong, it is logically right. We never accept the null hypothesis just because we are doing the statistical test with sample data points.

We will follow each of these steps above to do your hypothesis testing below.

P.S. Khan Academy has a set of videos that I think are intuative and helped me understand conceptually.


Hypothesis testing for Titanic

Formulating a well developed researched question:

Regarding this dataset, we can formulate the null hypothesis and alternative hypothesis by asking the following questions.

  • Is there a significant difference in the mean sex between the passenger who survived and passenger who did not survive?.
  • Is there a substantial difference in the survival rate between the male and female passengers?

The Null Hypothesis and The Alternative Hypothesis:

We can formulate our hypothesis by asking questions differently. However, it is essential to understand what our end goal is. Here our dependent variable or target variable is Survived. Therefore, we say

** Null Hypothesis(\(H_0\)):** There is no difference in the survival rate between the male and female passengers. or the mean difference between male and female passenger in the survival rate is zero.
** Alternative Hypothesis(\(H_A\)):** There is a difference in the survival rate between the male and female passengers. or the mean difference in the survival rate between male and female is not zero.

Onc thing we can do is try to set up the Null and Alternative Hypothesis in such way that, when we do our t-test, we can choose to do one tailed test. According to this article, one-tailed tests are more powerful than two-tailed test. In addition to that, this video is also quite helpful understanding these topics. with this in mind we can update/modify our null and alternative hypothesis. Let’s see how we can rewrite this..

Null Hypothesis(H0): male mean is greater or equal to female mean.

Alternative Hypothesis(H1): male mean is less than female mean.

Determine the test statistics:

This will be a two-tailed test since the difference between male and female passenger in the survival rate could be higher or lower than 0. Since we do not know the standard deviation(\(\sigma\)) and n is small, we will use the t-distribution.

Specify the significance level:

Specifying a significance level is an important step of the hypothesis test. It is an ultimate balance between type 1 error and type 2 error. We will discuss more in-depth about those in another lesson. For now, we have decided to make our significance level(\(\alpha\)) = 0.05. So, our confidence interval or non-rejection region would be (1 - \(\alpha\))=(1-0.05) = 95%.

Computing T-statistics and P-value:

Let’s take a random sample and see the difference.

male_mean = DF.mean(train[train.Sex .== 1, :Survived])
female_mean = DF.mean(train[train.Sex .== 0, :Survived])

println("Male survival mean: ", male_mean)
println("Female survival mean: ", female_mean)
println("The mean difference between male and female survival rate: ", female_mean - male_mean)
Male survival mean: 0.18608695652173912
Female survival mean: 0.7412140575079872
The mean difference between male and female survival rate: 0.5551271009862481

Now, we have to understand that those two means are not the population mean (\(\bar{\mu}\)). The population mean is a statistical term statistician uses to indicate the actual average of the entire group. The group can be any gathering of multiple numbers such as animal, human, plants, money, stocks. For example, To find the age population mean of Bulgaria; we will have to account for every single person’s age and take their age. Which is almost impossible and if we were to go that route; there is no point of doing statistics in the first place. Therefore we approach this problem using sample sets. The idea of using sample set is that; if we take multiple samples of the same population and take the mean of them and put them in a distribution; eventually the distribution start to look more like a normal distribution. The more samples we take and the more sample means will be added and, the closer the normal distribution will reach towards population mean. This is where Central limit theory comes from. We will go more in depth of this topic later on.

Going back to our dataset, like we are saying these means above are part of the whole story. We were given part of the data to train our machine learning models, and the other part of the data was held back for testing. Therefore, It is impossible for us at this point to know the population means of survival for male and females. Situation like this calls for a statistical approach. We will use the sampling distribution approach to do the test. let’s take 50 random sample of male and female from our train data.

male = train[train.Sex .== 1, :]
female = train[train.Sex .== 0, :]

# Listas vacías para almacenar las muestras de medias
m_mean_samples = Float64[]
f_mean_samples = Float64[]

# Generar 50 muestras aleatorias
for i in 1:50
    # Muestreo aleatorio de 50 elementos sin reemplazo
    male_sample = StatsBase.sample(male.Survived, 50, replace=false)
    female_sample = StatsBase.sample(female.Survived, 50, replace=false)
    
    push!(m_mean_samples, DF.mean(male_sample))
    push!(f_mean_samples, DF.mean(female_sample))
end

println("Male mean sample mean: ", round(DF.mean(m_mean_samples), digits=2))
println("Female mean sample mean: ", round(DF.mean(f_mean_samples), digits=2))
println("Difference between male and female mean sample mean: ", 
        round(DF.mean(f_mean_samples) - DF.mean(m_mean_samples), digits=2))
Male mean sample mean: 0.18
Female mean sample mean: 0.75
Difference between male and female mean sample mean: 0.56

H0: male mean is greater or equal to female mean
H1: male mean is less than female mean.

According to the samples our male samples (\(\bar{x}_m\)) and female samples(\(\bar{x}_f\)) mean measured difference is ~ 0.55(statistically this is called the point estimate of the male population mean and female population mean). keeping in mind that…

  • We randomly select 50 people to be in the male group and 50 people to be in the female group.
  • We know our sample is selected from a broader population(trainning set).
  • We know we could have totally ended up with a different random sample of males and females.

With all three points above in mind, how confident are we that, the measured difference is real or statistically significant? we can perform a t-test to evaluate that. When we perform a t-test we are usually trying to find out an evidence of significant difference between population mean with hypothesized mean(1 sample t-test) or in our case difference between two population means(2 sample t-test).

The t-statistics is the measure of a degree to which our groups differ standardized by the variance of our measurements. In order words, it is basically the measure of signal over noise. Let us describe the previous sentence a bit more for clarification. I am going to use this post as reference to describe the t-statistics here.

Calculating the t-statistics

\[t = \frac{\bar{x}-\mu}{\frac{S} {\sqrt{n}} }\]

Here..

  • \(\bar{x}\) is the sample mean.
  • \(\mu\) is the hypothesized mean.
  • S is the standard deviation.
  • n is the sample size.
  1. Now, the denominator of this fraction \((\bar{x}-\mu)\) is basically the strength of the signal. where we calculate the difference between hypothesized mean and sample mean. If the mean difference is higher, then the signal is stronger.

the numerator of this fraction ** \({S}/ {\sqrt{n}}\) ** calculates the amount of variation or noise of the data set. Here S is standard deviation, which tells us how much variation is there in the data. n is the sample size.

So, according to the explanation above, the t-value or t-statistics is basically measures the strength of the signal(the difference) to the amount of noise(the variation) in the data and that is how we calculate the t-value in one sample t-test. However, in order to calculate between two sample population mean or in our case we will use the follow equation.

\[t = \frac{\bar{x}_M - \bar{x}_F}{\sqrt {s^2 (\frac{1}{n_M} + \frac{1}{n_F})}}\]

This equation may seem too complex, however, the idea behind these two are similar. Both of them have the concept of signal/noise. The only difference is that we replace our hypothesis mean with another sample mean and the two sample sizes repalce one sample size.

Here..

  • \(\bar{x}_M\) is the mean of our male group sample measurements.
  • $ {x}_F$ is the mean of female group samples.
  • $ n_M$ and \(n_F\) are the sample number of observations in each group.
  • $ S^2$ is the sample variance.

It is good to have an understanding of what going on in the background. However, we will use scipy.stats to find the t-statistics.

Compare P-value with \(\alpha\)

It looks like the p-value is very small compared to our significance level(\(\alpha\))of 0.05. Our observation sample is statistically significant. Therefore, our null hypothesis is ruled out, and our alternative hypothesis is valid, which is “There is a significant difference in the survival rate between the male and female passengers.”

Part 5: Feature Engineering


Feature Engineering is exactly what its sounds like. Sometimes we want to create extra features from with in the features that we have, sometimes we want to remove features that are alike. Features engineering is the simple word for doing all those. It is important to remember that we will create new features in such ways that will not cause multicollinearity(when there is a relationship among independent variables) to occur.

name_length

Creating a new feature “name_length” that will take the count of letters of each name

train[!, :name_length] = [length(i) for i in train.Name]
test[!, :name_length] = [length(i) for i in test.Name]

function name_length_group(size)
    a = ""
    if size <= 20
        a = "short"
    elseif size <= 35
        a = "medium"
    elseif size <= 45
        a = "good"
    else
        a = "long"
    end
    return a
end

train[!, :nLength_group] = [name_length_group(x) for x in train.name_length]
test[!, :nLength_group] = [name_length_group(x) for x in test.name_length]
418-element Vector{String}:
 "short"
 "medium"
 "medium"
 "short"
 "good"
 "medium"
 "short"
 "medium"
 "good"
 "medium"
 ⋮
 "medium"
 "medium"
 "long"
 "medium"
 "short"
 "medium"
 "medium"
 "short"
 "medium"

title

Getting the title of each name as a new feature.

train[!, :title] = [split(i, '.')[1] for i in train.Name]
train[!, :title] = [split(i, ',')[2] for i in train.title]
888-element Vector{SubString{String}}:
 " Mr"
 " Mrs"
 " Miss"
 " Mrs"
 " Mr"
 " Mr"
 " Mr"
 " Master"
 " Mrs"
 " Mrs"
 ⋮
 " Miss"
 " Mr"
 " Mr"
 " Mrs"
 " Rev"
 " Miss"
 " Miss"
 " Mr"
 " Mr"
println(unique(train.title))
SubString{String}[" Mr", " Mrs", " Miss", " Master", " Don", " Rev", " Dr", " Mme", " Ms", " Major", " Lady", " Sir", " Mlle", " Col", " Capt", " the Countess", " Jonkheer"]
## Let's fix that
train[!, :title] = [strip(x) for x in train.title]
888-element Vector{SubString{String}}:
 "Mr"
 "Mrs"
 "Miss"
 "Mrs"
 "Mr"
 "Mr"
 "Mr"
 "Master"
 "Mrs"
 "Mrs"
 ⋮
 "Miss"
 "Mr"
 "Mr"
 "Mrs"
 "Rev"
 "Miss"
 "Miss"
 "Mr"
 "Mr"

## We can also combile all three lines above for test set here
test[!, :title] = [strip(split(split(i, '.')[1], ',')[2]) for i in test.Name]
## However it is important to be able to write readable code, and the line above is not so readable.
418-element Vector{SubString{String}}:
 "Mr"
 "Mrs"
 "Mr"
 "Mr"
 "Mrs"
 "Mr"
 "Miss"
 "Mr"
 "Mrs"
 "Mr"
 ⋮
 "Miss"
 "Miss"
 "Mrs"
 "Miss"
 "Mr"
 "Dona"
 "Mr"
 "Mr"
 "Master"
## Let's replace some of the rare values with the keyword 'rare' and other word choice of our own.
## train Data
train[!, :title] = [replace(i, "Ms" => "Miss") for i in train.title]
train[!, :title] = [replace(i, "Mlle" => "Miss") for i in train.title]
train[!, :title] = [replace(i, "Mme" => "Mrs") for i in train.title]
train[!, :title] = [replace(i, "Dr" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Col" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Major" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Don" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Jonkheer" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Sir" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Lady" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Capt" => "rare") for i in train.title]
train[!, :title] = [replace(i, "the Countess" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Rev" => "rare") for i in train.title]

## Now in programming there is a term called DRY(Don't repeat yourself), whenever we are repeating
## same code over and over again, there should be a light-bulb turning on in our head and make us think
## to code in a way that is not repeating or dull. Let's write a function to do exactly what we
## did in the code above, only not repeating and more interesting.
888-element Vector{String}:
 "Mr"
 "Mrs"
 "Miss"
 "Mrs"
 "Mr"
 "Mr"
 "Mr"
 "Master"
 "Mrs"
 "Mrs"
 ⋮
 "Miss"
 "Mr"
 "Mr"
 "Mrs"
 "rare"
 "Miss"
 "Miss"
 "Mr"
 "Mr"

## we are writing a function that can help us modify title column
"""
    This function helps modifying the title column
"""
function name_converted(feature)
    result = ""
    if feature in ["the Countess", "Capt", "Lady", "Sir", "Jonkheer", "Don", "Major", "Col", "Rev", "Dona", "Dr"]
        result = "rare"
    elseif feature in ["Ms", "Mlle"]
        result = "Miss"
    elseif feature == "Mme"
        result = "Mrs"
    else
        result = feature
    end
    return result
end

test[!, :title] = [name_converted(x) for x in test.title]
train[!, :title] = [name_converted(x) for x in train.title];
println(unique(train.title))
println(unique(test.title))
["Mr", "Mrs", "Miss", "Master", "rare"]
AbstractString["Mr", "Mrs", "Miss", "Master", "rare"]

family_size

Creating a new feature called “family_size”.

## Family_size seems like a good feature to create
train[!, :family_size] = train.SibSp + train.Parch .+ 1
test[!, :family_size] = test.SibSp + test.Parch .+ 1
418-element Vector{Int64}:
 1
 2
 1
 1
 3
 1
 1
 3
 1
 3
 ⋮
 3
 1
 2
 1
 1
 1
 1
 1
 3
## bin the family size.
"""
This function groups(loner, small, large) family based on family size
"""
function family_group(size)
    a = ""
    if size <= 1
        a = "loner"
    elseif size <= 4
        a = "small"
    else
        a = "large"
    end
    return a
end
Main.Notebook.family_group

## apply the family_group function in family_size
train[!, :family_group] = [family_group(x) for x in train.family_size]
test[!, :family_group] = [family_group(x) for x in test.family_size];

is_alone

train[!, :is_alone] = [i < 2 ? 1 : 0 for i in train.family_size]
test[!, :is_alone] = [i < 2 ? 1 : 0 for i in test.family_size];

ticket


println(StatsBase.sample(collect(StatsBase.countmap(train.Ticket)), 10))
Pair{String31, Int64}["11813" => 1, "342826" => 1, "349249" => 1, "4135" => 1, "349247" => 1, "36928" => 2, "12460" => 1, "113781" => 4, "A/5. 2151" => 1, "19952" => 1]

I have yet to figureout how to best manage ticket feature. So, any suggestion would be truly appreciated. For now, I will get rid off the ticket feature.

DF.select!(train, DF.Not(:Ticket))
DF.select!(test, DF.Not(:Ticket));

calculated_fare

## Calculating fare based on family size.
train[!, :calculated_fare] = train.Fare ./ train.family_size
test[!, :calculated_fare] = test.Fare ./ test.family_size;

Some people have travelled in groups like family or friends. It seems like Fare column kept a record of the total fare rather than the fare of individual passenger, therefore calculated fare will be much handy in this situation.

fare_group

"""
    This function creates a fare group based on the fare provided
    """
function fare_group(fare::Float64)
    a = ""
    if fare <= 4
        a = "Very_low"
    elseif fare <= 10
        a = "low"
    elseif fare <= 20
        a = "mid"
    elseif fare <= 45
        a = "high"
    else
        a = "very_high"
    end
    return a
end

train[!, :fare_group] = [fare_group(x) for x in train.calculated_fare]
test[!, :fare_group] = [fare_group(x) for x in test.calculated_fare];

Fare group was calculated based on calculated_fare. This can further help our cause.

PassengerId

It seems like PassengerId column only works as an id in this dataset without any significant effect on the dataset. Let’s drop it.

DF.select!(train, DF.Not(:PassengerId))
DF.select!(test, DF.Not(:PassengerId))
418×17 DataFrame
393 rows omitted
Row Pclass Name Sex Age SibSp Parch Fare Cabin Embarked name_length nLength_group title family_size family_group is_alone calculated_fare fare_group
Int64 String Int64 Float64? Int64 Int64 Float64 String Abstract… Int64 String Abstract… Int64 String Int64 Float64 String
1 3 Kelly, Mr. James 1 34.5 0 0 7.8292 G Q 16 short Mr 1 loner 1 7.8292 low
2 3 Wilkes, Mrs. James (Ellen Needs) 0 47.0 1 0 7.0 G S 32 medium Mrs 2 small 0 3.5 Very_low
3 2 Myles, Mr. Thomas Francis 1 62.0 0 0 9.6875 G Q 25 medium Mr 1 loner 1 9.6875 low
4 3 Wirz, Mr. Albert 1 27.0 0 0 8.6625 G S 16 short Mr 1 loner 1 8.6625 low
5 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 0 22.0 1 1 12.2875 G S 44 good Mrs 3 small 0 4.09583 low
6 3 Svensson, Mr. Johan Cervin 1 14.0 0 0 9.225 G S 26 medium Mr 1 loner 1 9.225 low
7 3 Connolly, Miss. Kate 0 30.0 0 0 7.6292 G Q 20 short Miss 1 loner 1 7.6292 low
8 2 Caldwell, Mr. Albert Francis 1 26.0 1 1 29.0 T S 28 medium Mr 3 small 0 9.66667 low
9 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) 0 18.0 0 0 7.2292 G C 41 good Mrs 1 loner 1 7.2292 low
10 3 Davies, Mr. John Samuel 1 21.0 2 0 24.15 F S 23 medium Mr 3 small 0 8.05 low
11 3 Ilieff, Mr. Ylio 1 missing 0 0 7.8958 G S 16 short Mr 1 loner 1 7.8958 low
12 1 Jones, Mr. Charles Cresson 1 46.0 0 0 26.0 F S 26 medium Mr 1 loner 1 26.0 high
13 1 Snyder, Mrs. John Pillsbury (Nelle Stevenson) 0 23.0 1 0 82.2667 B S 45 good Mrs 2 small 0 41.1334 high
407 2 Ware, Mr. William Jeffery 1 23.0 1 0 10.5 G S 25 medium Mr 2 small 0 5.25 low
408 1 Widener, Mr. George Dunton 1 50.0 1 1 211.5 C C 26 medium Mr 3 small 0 70.5 very_high
409 3 Riordan, Miss. Johanna Hannah"" 0 missing 0 0 7.7208 G Q 31 medium Miss 1 loner 1 7.7208 low
410 3 Peacock, Miss. Treasteall 0 3.0 1 1 13.775 G S 25 medium Miss 3 small 0 4.59167 low
411 3 Naughton, Miss. Hannah 0 missing 0 0 7.75 G Q 22 medium Miss 1 loner 1 7.75 low
412 1 Minahan, Mrs. William Edward (Lillian E Thorpe) 0 37.0 1 0 90.0 C Q 47 long Mrs 2 small 0 45.0 high
413 3 Henriksson, Miss. Jenny Lovisa 0 28.0 0 0 7.775 G S 30 medium Miss 1 loner 1 7.775 low
414 3 Spector, Mr. Woolf 1 missing 0 0 8.05 G S 18 short Mr 1 loner 1 8.05 low
415 1 Oliva y Ocana, Dona. Fermina 0 39.0 0 0 108.9 C C 28 medium rare 1 loner 1 108.9 very_high
416 3 Saether, Mr. Simon Sivertsen 1 38.5 0 0 7.25 G S 28 medium Mr 1 loner 1 7.25 low
417 3 Ware, Mr. Frederick 1 missing 0 0 8.05 G S 19 short Mr 1 loner 1 8.05 low
418 3 Peter, Master. Michael J 1 missing 1 1 22.3583 F C 24 medium Master 3 small 0 7.45277 low

Creating dummy variables

You might be wondering what is a dummy variable?

Dummy variable is an important prepocessing machine learning step. Often times Categorical variables are an important features, which can be the difference between a good model and a great model. While working with a dataset, having meaningful value for example, “male” or “female” instead of 0’s and 1’s is more intuitive for us. However, machines do not understand the value of categorical values, for example, in this dataset we have gender male or female, algorithms do not accept categorical variables as input. In order to feed data in a machine learning model, we

function get_dummies(df, columns)
    """
    Creates categorical one hot encoding variables
    """
    result_df = copy(df)
    
    for col in columns
        unique_vals = unique(skipmissing(result_df[!, col]))
        
        dummy_transforms = [@. col => DF.ByRow(isequal(val)) => Symbol(col, "_", val) for val in unique_vals]
        
        DF.transform!(result_df, dummy_transforms...)
        DF.select!(result_df, DF.Not(col))

    end
    
    return result_df
end

dummy_cols = [:title, :Pclass, :Cabin, :Embarked, :nLength_group, :family_group, :fare_group]
train = get_dummies(train, dummy_cols)
test = get_dummies(test, dummy_cols);
cols_to_drop = [:family_size, :Name, :Fare, :name_length]
DF.select!(train, DF.Not(cols_to_drop))
DF.select!(test, DF.Not(cols_to_drop))
418×38 DataFrame
393 rows omitted
Row Sex Age SibSp Parch is_alone calculated_fare title_Mr title_Mrs title_Miss title_Master title_rare Pclass_3 Pclass_2 Pclass_1 Cabin_G Cabin_T Cabin_F Cabin_B Cabin_E Cabin_C Cabin_A Cabin_D Cabin_N Embarked_Q Embarked_S Embarked_C nLength_group_short nLength_group_medium nLength_group_good nLength_group_long family_group_loner family_group_small family_group_large fare_group_low fare_group_Very_low fare_group_high fare_group_mid fare_group_very_high
Int64 Float64? Int64 Int64 Int64 Float64 Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool
1 1 34.5 0 0 1 7.8292 true false false false false true false false true false false false false false false false false true false false true false false false true false false true false false false false
2 0 47.0 1 0 0 3.5 false true false false false true false false true false false false false false false false false false true false false true false false false true false false true false false false
3 1 62.0 0 0 1 9.6875 true false false false false false true false true false false false false false false false false true false false false true false false true false false true false false false false
4 1 27.0 0 0 1 8.6625 true false false false false true false false true false false false false false false false false false true false true false false false true false false true false false false false
5 0 22.0 1 1 0 4.09583 false true false false false true false false true false false false false false false false false false true false false false true false false true false true false false false false
6 1 14.0 0 0 1 9.225 true false false false false true false false true false false false false false false false false false true false false true false false true false false true false false false false
7 0 30.0 0 0 1 7.6292 false false true false false true false false true false false false false false false false false true false false true false false false true false false true false false false false
8 1 26.0 1 1 0 9.66667 true false false false false false true false false true false false false false false false false false true false false true false false false true false true false false false false
9 0 18.0 0 0 1 7.2292 false true false false false true false false true false false false false false false false false false false true false false true false true false false true false false false false
10 1 21.0 2 0 0 8.05 true false false false false true false false false false true false false false false false false false true false false true false false false true false true false false false false
11 1 missing 0 0 1 7.8958 true false false false false true false false true false false false false false false false false false true false true false false false true false false true false false false false
12 1 46.0 0 0 1 26.0 true false false false false false false true false false true false false false false false false false true false false true false false true false false false false true false false
13 0 23.0 1 0 0 41.1334 false true false false false false false true false false false true false false false false false false true false false false true false false true false false false true false false
407 1 23.0 1 0 0 5.25 true false false false false false true false true false false false false false false false false false true false false true false false false true false true false false false false
408 1 50.0 1 1 0 70.5 true false false false false false false true false false false false false true false false false false false true false true false false false true false false false false false true
409 0 missing 0 0 1 7.7208 false false true false false true false false true false false false false false false false false true false false false true false false true false false true false false false false
410 0 3.0 1 1 0 4.59167 false false true false false true false false true false false false false false false false false false true false false true false false false true false true false false false false
411 0 missing 0 0 1 7.75 false false true false false true false false true false false false false false false false false true false false false true false false true false false true false false false false
412 0 37.0 1 0 0 45.0 false true false false false false false true false false false false false true false false false true false false false false false true false true false false false true false false
413 0 28.0 0 0 1 7.775 false false true false false true false false true false false false false false false false false false true false false true false false true false false true false false false false
414 1 missing 0 0 1 8.05 true false false false false true false false true false false false false false false false false false true false true false false false true false false true false false false false
415 0 39.0 0 0 1 108.9 false false false false true false false true false false false false false true false false false false false true false true false false true false false false false false false true
416 1 38.5 0 0 1 7.25 true false false false false true false false true false false false false false false false false false true false false true false false true false false true false false false false
417 1 missing 0 0 1 8.05 true false false false false true false false true false false false false false false false false false true false true false false false true false false true false false false false
418 1 missing 1 1 0 7.45277 false false false true false true false false false false true false false false false false false false false true false true false false false true false true false false false false

age

As I promised before, we are going to use Random forest regressor in this section to predict the missing age values. Let’s do it

train[.!ismissing.(train.Age), :]
711×38 DataFrame
686 rows omitted
Row Sex Age SibSp Parch Survived is_alone calculated_fare title_Mr title_Mrs title_Miss title_Master title_rare Pclass_3 Pclass_1 Pclass_2 Cabin_G Cabin_C Cabin_E Cabin_F Cabin_T Cabin_D Cabin_A Cabin_B Embarked_S Embarked_C Embarked_Q nLength_group_medium nLength_group_long nLength_group_good nLength_group_short family_group_small family_group_loner family_group_large fare_group_Very_low fare_group_high fare_group_low fare_group_very_high fare_group_mid
Int64 Float64? Int64 Int64 Int64 Int64 Float64 Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool Bool
1 1 22.0 1 0 0 0 3.625 true false false false false true false false true false false false false false false false true false false true false false false true false false true false false false false
2 0 38.0 1 0 1 0 35.6416 false true false false false false true false false true false false false false false false false true false false true false false true false false false true false false false
3 0 26.0 0 0 1 1 7.925 false false true false false true false false true false false false false false false false true false false true false false false false true false false false true false false
4 0 35.0 1 0 1 0 26.55 false true false false false false true false false true false false false false false false true false false false false true false true false false false true false false false
5 1 35.0 0 0 0 1 8.05 true false false false false true false false true false false false false false false false true false false true false false false false true false false false true false false
6 1 54.0 0 0 0 1 51.8625 true false false false false false true false false false true false false false false false true false false true false false false false true false false false false true false
7 1 2.0 3 1 0 0 4.215 false false false true false true false false false false false true false false false false true false false true false false false false false true false false true false false
8 0 27.0 0 2 1 0 3.7111 false true false false false true false false true false false false false false false false true false false false true false false true false false true false false false false
9 0 14.0 1 0 1 0 15.0354 false true false false false false false true false false false false true false false false false true false true false false false true false false false false false false true
10 0 4.0 1 1 1 0 5.56667 false false true false false true false false true false false false false false false false true false false true false false false true false false false false true false false
11 0 58.0 0 0 1 1 26.55 false false true false false false true false false true false false false false false false true false false true false false false false true false false true false false false
12 1 20.0 0 0 0 1 8.05 true false false false false true false false true false false false false false false false true false false true false false false false true false false false true false false
13 1 39.0 1 5 0 0 4.46786 true false false false false true false false false false false false true false false false true false false true false false false false false true false false true false false
700 1 19.0 0 0 0 1 7.8958 true false false false false true false false true false false false false false false false true false false false false false true false true false false false true false false
701 0 56.0 0 1 1 0 41.5791 false true false false false false true false false true false false false false false false false true false false false true false true false false false true false false false
702 0 25.0 0 1 1 0 13.0 false true false false false false false true false false false true false false false false true false false false false true false true false false false false false false true
703 1 33.0 0 0 0 1 7.8958 true false false false false true false false true false false false false false false false true false false false false false true false true false false false true false false
704 0 22.0 0 0 0 1 10.5167 false false true false false true false false true false false false false false false false true false false true false false false false true false false false false false true
705 1 28.0 0 0 0 1 10.5 true false false false false false false true true false false false false false false false true false false true false false false false true false false false false false true
706 1 25.0 0 0 0 1 7.05 true false false false false true false false true false false false false false false false true false false true false false false false true false false false true false false
707 0 39.0 0 5 0 0 4.85417 false true false false false true false false false false false false true false false false false false true false false true false false false true false false true false false
708 1 27.0 0 0 0 1 13.0 false false false false true false false true true false false false false false false false true false false true false false false false true false false false false false true
709 0 19.0 0 0 1 1 30.0 false false true false false false true false false false false false false false false true true false false true false false false false true false false true false false false
710 1 26.0 0 0 1 1 30.0 true false false false false false true false false true false false false false false false false true false true false false false false true false false true false false false
711 1 32.0 0 0 0 1 7.75 true false false false false true false false true false false false false false false false false false true false false false true false true false false false true false false
Pkg.add(["DecisionTree", "MLJDecisionTreeInterface"])
import MLJ
import MLJModels
   Resolving package versions...
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Project.toml`
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Manifest.toml`
age_train_data = train[.!ismissing.(train.Age), :]
y_train = DF.collect(age_train_data[:, :Age])
711-element Vector{Union{Missing, Float64}}:
 22.0
 38.0
 26.0
 35.0
 35.0
 54.0
  2.0
 27.0
 14.0
  4.0
  ⋮
 33.0
 22.0
 28.0
 25.0
 39.0
 27.0
 19.0
 26.0
 32.0
function completing_age(df::DF.DataFrame)
    #Prepare data
    age_train_data = df[.!ismissing.(df.Age), :]
    age_test_data = df[ismissing.(df.Age), :]

    features_to_exclude = [:Age]

    # exclude "survived" from train 
    if ("Survived" in DF.names(df))
        push!(features_to_exclude, :Survived)
    end

    # data prepared from train
    X_train = DF.select(age_train_data, DF.Not(features_to_exclude))
    y_train = DF.collect(age_train_data.Age)

    
    # data prepared from test
    X_test = DF.select(age_test_data, DF.Not(features_to_exclude))


    # model
    DecisionTreeRegressor = MLJ.@load RandomForestRegressor pkg=DecisionTree
    model = DecisionTreeRegressor()
    mach = MLJ.machine(model, X_train, y_train)
    MLJ.fit!(mach)

    y_hat = MLJ.predict(mach, X_test)

    # fill out missing values with prediction
    df[ismissing.(df.Age), :Age] = y_hat
    df.Age = Float64.(df.Age) #Ensure its not Union{missing} (How is suffer because of not having this line)

    return df
end
completing_age (generic function with 1 method)
completing_age(train)
completing_age(test)
[ Info: For silent loading, specify `verbosity=0`. 
import MLJDecisionTreeInterface ✔
MethodError: no method matching MLJDecisionTreeInterface.RandomForestRegressor()
The type `MLJDecisionTreeInterface.RandomForestRegressor` exists, but no method is defined for this combination of argument types when trying to construct it.

Closest candidates are:
  MLJDecisionTreeInterface.RandomForestRegressor(; max_depth, min_samples_leaf, min_samples_split, min_purity_increase, n_subfeatures, n_trees, sampling_fraction, feature_importance, rng) (method too new to be called from this world context.)
   @ MLJDecisionTreeInterface none:0

Stacktrace:
 [1] completing_age(df::DataFrames.DataFrame)
   @ Main.Notebook C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1638
 [2] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1653

Let’s take a look at the histogram of the age column.

Makie.hist( train.Age, bins=100)
ArgumentError: start and stop must be finite, got NaN and NaN
Stacktrace:
  [1] _linspace(start::Float64, stop::Float64, len::Int64)
    @ Base .\twiceprecision.jl:669
  [2] range_start_stop_length(start::Float64, stop::Float64, len::Int64)
    @ Base .\twiceprecision.jl:664
  [3] _range
    @ .\range.jl:167 [inlined]
  [4] range
    @ .\range.jl:150 [inlined]
  [5] pick_hist_edges(vals::Vector{Float64}, bins::Int64)
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\stats\hist.jl:153
  [6] map(f::typeof(Makie.pick_hist_edges), scene::Union{Makie.Scene, Makie.Plot}, arg1::ComputePipeline.Computed, args::ComputePipeline.Computed; ignore_equal_values::Bool, priority::Int64)
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\scenes.jl:195
  [7] map(f::typeof(Makie.pick_hist_edges), scene::Union{Makie.Scene, Makie.Plot}, arg1::ComputePipeline.Computed, args::ComputePipeline.Computed)
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\scenes.jl:190
  [8] plot!(plot::Makie.Hist{Tuple{Vector{Float64}}})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\stats\hist.jl:165
  [9] connect_plot!(parent::Makie.Scene, plot::Makie.Hist{Tuple{Vector{Float64}}})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\compute-plots.jl:718
 [10] plot!
    @ C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\interfaces.jl:209 [inlined]
 [11] plot!(ax::Makie.Axis, plot::Makie.Hist{Tuple{Vector{Float64}}})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\figureplotting.jl:431
 [12] plot!(fa::Makie.FigureAxis, plot::Makie.Hist{Tuple{Vector{Float64}}})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\figureplotting.jl:427
 [13] _create_plot(F::Function, attributes::Dict{Symbol, Any}, args::Vector{Union{Missing, Float64}})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\figureplotting.jl:330
 [14] hist(args::Vector{Union{Missing, Float64}}; kw::@Kwargs{bins::Int64})
    @ Makie C:\Users\Fabrizio\.julia\packages\Makie\aJUtI\src\recipes.jl:517
 [15] top-level scope
    @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1660

age_group

We can create a new feature by grouping the “Age” column

 """
    This function creates a bin for age
    """
function age_group_fun(age::Float64)
    a = ""
    if age <= 1
        a = "infant"
    elseif age <= 4
        a = "toddler"
    elseif age <= 13
        a = "child"
    elseif age <= 18
        a = "teenager"
    elseif age <= 35
        a = "Young_Adult"
    elseif age <= 45
        a = "adult"
    elseif age <= 55
        a = "middle_aged"
    elseif age <= 65
        a = "senior_citizen"
    else
        a = "old"
    end
    return a
end
Main.Notebook.age_group_fun

train[!, :age_group] = [age_group_fun(x) for x in train.Age]
test[!, :age_group] = [age_group_fun(x) for x in test.Age]

## Creating dummies for "age_group" feature.
train = get_dummies(train, [:age_group])
test = get_dummies(test, [:age_group])
MethodError: no method matching age_group_fun(::Missing)
The function `age_group_fun` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  age_group_fun(::Float64)
   @ Main.Notebook C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1671

Stacktrace:
 [1] (::var"#87#88")(x::Missing)
   @ Main.Notebook .\none:0
 [2] iterate
   @ .\generator.jl:48 [inlined]
 [3] collect_to!
   @ .\array.jl:849 [inlined]
 [4] collect_to_with_first!
   @ .\array.jl:827 [inlined]
 [5] collect(itr::Base.Generator{Vector{Union{Missing, Float64}}, var"#87#88"})
   @ Base .\array.jl:801
 [6] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1699

Need to paraphrase this section

Feature Selection

Feature selection is an important part of machine learning models. There are many reasons why we use feature selection.

  • Simple models are easier to interpret. People who acts according to model results have a better understanding of the model.
  • Shorter training times.
  • Enhanced generalisation by reducing overfitting.
  • Easier to implement by software developers> model production.
  • <ul>
            <li>As Data Scientists we need to remember no to creating models with too many variables since it might overwhelm production engineers.</li>
    </ul>
    <li>Reduced risk of data errors during model use</li>
    <li>Data redundancy</li>

Part 6: Pre-Modeling Tasks

6a. Separating dependent and independent variables


Before we apply any machine learning models, It is important to separate dependent and independent variables. Our dependent variable or target variable is something that we are trying to find, and our independent variable is the features we use to find the dependent variable. The way we use machine learning algorithm in a dataset is that we train our machine learning model by specifying independent variables and dependent variable. To specify them, we need to separate them from each other, and the code below does just that.

P.S. In our test dataset, we do not have a dependent variable feature. We are to predict that using machine learning models.

# separating our independent and dependent variable
X = DF.select(train, DF.Not(:Survived))
y = train.Survived;

6b. Splitting the training data


There are multiple ways of splitting data. They are…

  • train_test_split.
  • cross_validation.

We have separated dependent and independent features; We have separated train and test data. So, why do we still have to split our training data? If you are curious about that, I have the answer. For this competition, when we train the machine learning algorithms, we use part of the training set usually two-thirds of the train data. Once we train our algorithm using 2/3 of the train data, we start to test our algorithms using the remaining data. If the model performs well we dump our test data in the algorithms to predict and submit the competition. The code below, basically splits the train data into 4 parts, X_train, X_test, y_train, y_test.

  • X_train and y_train first used to train the algorithm.
  • then, X_test is used in that trained algorithms to predict outcomes.
  • Once we get the outcomes, we compare it with y_test

By comparing the outcome of the model with y_test, we can determine whether our algorithms are performing well or not. As we compare we use confusion matrix to determine different aspects of model performance.

P.S. When we use cross validation it is important to remember not to use X_train, X_test, y_train and y_test, rather we will use X and y. I will discuss more on that.

using Random
Random.seed!(0)

# We convert de datatype for better integration with models
y = MLJ.coerce(y, MLJ.Multiclass)  
# We convert all features to continous, ex. (false true) -> (0, 1)
X = MLJ.coerce(X, MLJ.Count => MLJ.Continuous, MLJ.OrderedFactor => MLJ.Continuous)  


(X_train, X_test), (y_train, y_test) = MLJ.partition((X, y), 0.67, shuffle=true, multi=true);
size(X_train)
(595, 37)
size(X_test)
(293, 37)

6c. Feature Scaling


Feature scaling is an important concept of machine learning models. Often times a dataset contain features highly varying in magnitude and unit. For some machine learning models, it is not a problem. However, for many other ones, its quite a problem. Many machine learning algorithms uses euclidian distances to calculate the distance between two points, it is quite a problem. Let’s again look at a the sample of the train dataset below.

sample = shuffle(1:DF.nrow(X_train))[1:5] 
X_train[sample, :]
5×37 DataFrame
Row Sex Age SibSp Parch is_alone calculated_fare title_Mr title_Mrs title_Miss title_Master title_rare Pclass_3 Pclass_1 Pclass_2 Cabin_G Cabin_C Cabin_E Cabin_F Cabin_T Cabin_D Cabin_A Cabin_B Embarked_S Embarked_C Embarked_Q nLength_group_medium nLength_group_long nLength_group_good nLength_group_short family_group_small family_group_loner family_group_large fare_group_Very_low fare_group_high fare_group_low fare_group_very_high fare_group_mid
Float64 Float64? Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 1.0 44.0 0.0 0.0 1.0 8.05 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2 0.0 6.0 0.0 1.0 0.0 16.5 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0
3 1.0 9.0 0.0 2.0 0.0 6.84167 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0
4 1.0 40.0 0.0 0.0 1.0 7.8958 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
5 1.0 28.0 0.0 0.0 1.0 13.5 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0

Here Age and Calculated_fare is much higher in magnitude compared to others machine learning features. This can create problems as many machine learning models will get confused thinking Age and Calculated_fare have higher weight than other features. Therefore, we need to do feature scaling to get a better result. There are multiple ways to do feature scaling.

  • MinMaxScaler-Scales the data using the max and min values so that it fits between 0 and 1.
  • StandardScaler-Scales the data so that it has mean 0 and variance of 1.
  • RobustScaler-Scales the data similary to Standard Scaler, but makes use of the median and scales using the interquertile range so as to aviod issues with large outliers.

I will discuss more on that in a different kernel. For now we will use Standard Scaler to feature scale our dataset.

P.S. I am showing a sample of both before and after so that you can see how scaling changes the dataset.

Before Scaling

print(DF.names(X_train))
DF.first(X_train, 5)
["Sex", "Age", "SibSp", "Parch", "is_alone", "calculated_fare", "title_Mr", "title_Mrs", "title_Miss", "title_Master"  …  "nLength_group_good", "nLength_group_short", "family_group_small", "family_group_loner", "family_group_large", "fare_group_Very_low", "fare_group_high", "fare_group_low", "fare_group_very_high", "fare_group_mid"]
5×37 DataFrame
Row Sex Age SibSp Parch is_alone calculated_fare title_Mr title_Mrs title_Miss title_Master title_rare Pclass_3 Pclass_1 Pclass_2 Cabin_G Cabin_C Cabin_E Cabin_F Cabin_T Cabin_D Cabin_A Cabin_B Embarked_S Embarked_C Embarked_Q nLength_group_medium nLength_group_long nLength_group_good nLength_group_short family_group_small family_group_loner family_group_large fare_group_Very_low fare_group_high fare_group_low fare_group_very_high fare_group_mid
Float64 Float64? Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 1.0 29.0 0.0 0.0 1.0 8.05 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
2 1.0 missing 0.0 0.0 1.0 8.4583 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
3 1.0 34.5 0.0 0.0 1.0 6.4375 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
4 0.0 31.0 0.0 0.0 1.0 8.6833 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
5 1.0 25.0 0.0 0.0 1.0 7.05 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 1.0 0.0 0.0
Standardizer = MLJ.@load Standardizer pkg=MLJModels
standardizer = Standardizer()
mach_scaler = MLJ.machine(standardizer, X_train)

MLJ.fit!(mach_scaler)
X_train_scaled = MLJ.transform(mach_scaler, X_train)
X_test_scaled = MLJ.transform(mach_scaler, X_test);
[ Info: For silent loading, specify `verbosity=0`. 
import MLJModels ✔
[ Info: Training machine(Standardizer(features = Symbol[], …), …).

After Scaling

You can see how the features have transformed above.

NOTE: In this example, in difference with the original notebook, the categorical and boolean columns were not transformed, as it doesnt affect negativaly the model

Part 7: Modeling the Data


In the previous versions of this kernel, I thought about explaining each model before applying it. However, this process makes this kernel too lengthy to sit and read at one go. Therefore I have decided to break this kernel down and explain each algorithm in a different kernel and add the links here. If you like to review logistic regression, please click here.

Pkg.add("MLJLinearModels")
   Resolving package versions...
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Project.toml`
  No Changes to `C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\Manifest.toml`

LogisticClassifier = MLJ.@load LogisticClassifier pkg=MLJLinearModels

logreg = LogisticClassifier(penalty=:l1)
mach_logreg = MLJ.machine(logreg, X_train_scaled, y_train)
MLJ.fit!(mach_logreg)

y_prob = MLJ.predict(mach_logreg, X_test_scaled);
# y_prob Has the raw predictions -> [0.13, 0.4, 0.75] etc
# MLJ.mode.(y_prob) has the categorical predictions -> [0, 0, 1] etc
y_pred = MLJ.mode.(y_prob)
[ Info: For silent loading, specify `verbosity=0`. 
import MLJLinearModels ✔
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJLinearModels.LogisticClassifier` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
[ Info: Training machine(LogisticClassifier(lambda = 2.220446049250313e-16, …), …).
Info: Solver: MLJLinearModels.ProxGrad
  accel: Bool true
  max_iter: Int64 1000
  tol: Float64 0.0001
  max_inner: Int64 100
  beta: Float64 0.8
  gram: Bool false
Error: Problem fitting the machine machine(LogisticClassifier(lambda = 2.220446049250313e-16, …), …). 
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:694
[ Info: Running type checks... 
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJLinearModels.LogisticClassifier` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
MethodError: no method matching fit(::MLJLinearModels.GeneralizedLinearRegression{MLJLinearModels.LogisticLoss, MLJLinearModels.ScaledPenalty{MLJLinearModels.L1Penalty}}, ::Matrix{Union{Missing, Float64}}, ::Vector{Int64}; solver::MLJLinearModels.ProxGrad)
The function `fit` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  fit(::MLJLinearModels.GeneralizedLinearRegression, ::AbstractMatrix{<:Real}, ::AbstractVector{<:Real}; data, solver)
   @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\fit\default.jl:36
  fit(::MLJLinearModels.GeneralizedLinearRegression; kwargs...)
   @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\fit\default.jl:50

Stacktrace:
 [1] fit(m::MLJLinearModels.LogisticClassifier, verb::Int64, X::DataFrames.DataFrame, y::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
   @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:74
 [2] fit_only!(mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
   @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:692
 [3] fit_only!
   @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:617 [inlined]
 [4] #fit!#63
   @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:789 [inlined]
 [5] fit!(mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true})
   @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:786
 [6] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1853

Evaluating a classification model

There are multiple ways to evaluate a classification model.

  • Confusion Matrix.
  • ROC Curve
  • AUC Curve.

Confusion Matrix

Confusion matrix, a table that describes the performance of a classification model. Confusion Matrix tells us how many our model predicted correctly and incorrectly in terms of binary/multiple outcome classes by comparing actual and predicted cases. For example, in terms of this dataset, our model is a binary one and we are trying to classify whether the passenger survived or not survived. we have fit the model using X_train and y_train and predicted the outcome of X_test in the variable y_pred. So, now we will use a confusion matrix to compare between y_test and y_pred. Let’s do the confusion matrix.

conf_matrix = MLJ.confusion_matrix(y_pred, y_test)
UndefVarError: `y_pred` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1875

Our y_test has a total of 294 data points; part of the original train set that we splitted in order to evaluate our model. Each number here represents certain details about our model. If we were to think about this interms of column and raw, we could see that…

  • the first column is of data points that the machine predicted as not-survived.
  • the second column is of the statistics that the model predicted as survievd.
  • In terms of raws, the first raw indexed as “Not-survived” means that the value in that raw are actual statistics of not survived once.
  • and the “Survived” indexed raw are values that actually survived.

Now you can see that the predicted not-survived and predicted survived sort of overlap with actual survived and actual not-survived. After all it is a matrix and we have some terminologies to call these statistics more specifically. Let’s see what they are

  • True Positive(TP): values that the model predicted as yes(survived) and is actually yes(survived).
  • True Negative(TN): values that model predicted as no(not-survived) and is actually no(not-survived)
  • False Positive(or Type I error): values that model predicted as yes(survived) but actually no(not-survived)
  • False Negative(or Type II error): values that model predicted as no(not-survived) but actually yes(survived)

For this dataset, whenever the model is predicting something as yes, it means the model is predicting that the passenger survived and for cases when the model predicting no; it means the passenger did not survive. Let’s determine the value of all these terminologies above.

  • True Positive(TP):85
  • True Negative(TN):158
  • False Positive(FP):28
  • False Negative(FN):22

From these four terminologies, we can compute many other rates that are used to evaluate a binary classifier.

Accuracy:

** Accuracy is the measure of how often the model is correct.**

  • (TP + TN)/total = (85+158)/293 = .829

We can also calculate accuracy score using scikit learn.

MLJ.accuracy(y_pred, y_test)
UndefVarError: `y_pred` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1914

Misclassification Rate: Misclassification Rate is the measure of how often the model is wrong**

  • Misclassification Rate and Accuracy are opposite of each other.
  • Missclassification is equivalent to 1 minus Accuracy.
  • Misclassification Rate is also known as “Error Rate”.

(FP + FN)/Total = (28+22)/293 = 0.17

True Positive Rate/Recall/Sensitivity: How often the model predicts yes(survived) when it’s actually yes(survived)?

TP/(TP+FN) = 85/(85+22) = 0.794392523364486

MLJ.recall(y_pred, y_test)
UndefVarError: `y_pred` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1930

False Positive Rate: How often the model predicts yes(survived) when it’s actually no(not-survived)?

FP/(FP+TN) = 28/(28+158) = 0.1505376344

True Negative Rate/Specificity: How often the model predicts no(not-survived) when it’s actually no(not-survived)?

  • True Negative Rate is equivalent to 1 minus False Positive Rate.

TN/(TN+FP) = 158/(158+28) = 0.84946236559

Precision: How often is it correct when the model predicts yes.

TP/(TP+FP) = 85/(85+28) = 0.75221238938

MLJ.ppv(y_pred, y_test)  # aka precision
UndefVarError: `y_pred` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1948
function simple_classification_report(y_true, y_pred)
    println("Classification Report:")
    println("======================")
    
    # Métricas principales
    println("Accuracy: ", round(MLJ.accuracy(y_pred, y_true), digits=4))
    println("Balanced Accuracy: ", round(MLJ.balanced_accuracy(y_pred, y_true), digits=4))
    println("F1 Score: ", round(MLJ.f1score(y_pred, y_true), digits=4))
    println("  Precision: ", round(MLJ.positive_predictive_value(y_pred, y_true), digits=4))
    println("  Recall:    ", round(MLJ.true_positive_rate(y_pred, y_true), digits=4))
end

# Usar la función
simple_classification_report(y_test, y_pred)
UndefVarError: `y_pred` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1965

we have our confusion matrix. How about we give it a little more character.

fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Confusion Matrix",  
    xticks=((1:2),["Actual 0", "Actual 1"]), 
    yticks=((1:2), ["Predicted 0", "Predicted 1"]),
)

hm = Makie.heatmap!(ax, (1:2), (1:2), conf_matrix.mat, colormap="Blues")

for i in 1:2
    for j in 1:2
        text_val = conf_matrix.mat[j, i]
        
        Makie.text!(ax, i, j, 
            text = string(round(text_val, digits=2)),
            color = abs(text_val) > 100 ? :white : :black,
            align = (:center, :center)
        )
    end
end

Makie.Colorbar(fig[1, 2], hm, 
    label = "Counts",
)

fig
UndefVarError: `conf_matrix` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:1978

AUC & ROC Curve

fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "ROC Curve",
    xlabel = "False Positive Rate",
    ylabel = "True Positive Rate",
   
)
fpr, tpr, _ = MLJ.roc_curve(y_prob, y_test)

Makie.lines!(ax, fpr, tpr,
    label = "ROC Curve",
    linewidth=4
)
Makie.ablines!(ax, 0, 1,
    color = :black,
    linestyle=:dash,
)
fig
UndefVarError: `y_prob` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2009

ROC:

roc_auc = MLJ.auc(y_prob, y_test)
UndefVarError: `y_prob` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2024

Using Cross-validation:

Pros:

  • Helps reduce variance.
  • Expands models predictability.
mach_standarizer = MLJ.machine(standardizer, X)
MLJ.fit!(mach_standarizer)
X = MLJ.transform(mach_standarizer, X)
[ Info: Training machine(Standardizer(features = Symbol[], …), …).
888×37 DataFrame
863 rows omitted
Row Sex Age SibSp Parch is_alone calculated_fare title_Mr title_Mrs title_Miss title_Master title_rare Pclass_3 Pclass_1 Pclass_2 Cabin_G Cabin_C Cabin_E Cabin_F Cabin_T Cabin_D Cabin_A Cabin_B Embarked_S Embarked_C Embarked_Q nLength_group_medium nLength_group_long nLength_group_good nLength_group_short family_group_small family_group_loner family_group_large fare_group_Very_low fare_group_high fare_group_low fare_group_very_high fare_group_mid
Float64 Float64? Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64 Float64
1 0.737384 22.0 0.430385 -0.473087 -1.2304 -0.573109 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 1.43152 -1.2304 -0.273817 4.54261 -0.457791 -1.06218 -0.290093 -0.455992
2 -1.35462 38.0 0.430385 -0.473087 -1.2304 0.657024 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 3.01342 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 -1.62369 2.07666 -0.307957 -1.16626 3.88952 -0.312316 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
3 -1.35462 26.0 -0.475259 -0.473087 0.811831 -0.407896 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
4 -1.35462 35.0 0.430385 -0.473087 -1.2304 0.307708 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 3.01342 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 -1.16626 -0.256811 3.19828 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
5 0.737384 35.0 -0.475259 -0.473087 0.811831 -0.403093 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
6 0.737384 missing -0.475259 -0.473087 0.811831 -0.387405 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 -1.62369 -0.481001 3.24355 -1.16626 -0.256811 -0.312316 1.63292 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
7 0.737384 54.0 -0.475259 -0.473087 0.811831 1.28026 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 -0.331476 4.8621 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 -1.06218 3.44328 -0.455992
8 0.737384 2.0 2.24167 0.766149 -1.2304 -0.55044 -1.17437 -0.406409 -0.510949 4.60175 -0.162971 0.89869 -0.561427 -0.510949 -1.04552 -0.331476 -0.205441 2.43533 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 -1.2304 3.64796 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
9 -1.35462 27.0 -0.475259 2.00539 -1.2304 -0.569801 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 -1.16626 3.88952 -0.312316 -0.611712 1.43152 -1.2304 -0.273817 4.54261 -0.457791 -1.06218 -0.290093 -0.455992
10 -1.35462 14.0 0.430385 -0.473087 -1.2304 -0.134702 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 -1.11148 -0.561427 1.95494 -1.04552 -0.331476 -0.205441 -0.41016 3.96699 -0.196349 -0.199419 -0.249261 -1.62369 2.07666 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 -0.457791 -1.06218 -0.290093 2.19055
11 -1.35462 4.0 0.430385 0.766149 -1.2304 -0.498507 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
12 -1.35462 58.0 -0.475259 -0.473087 0.811831 0.307708 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 3.01342 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
13 0.737384 20.0 -0.475259 -0.473087 0.811831 -0.403093 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
877 -1.35462 56.0 -0.475259 0.766149 -1.2304 0.885153 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 3.01342 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 -1.62369 2.07666 -0.307957 -1.16626 -0.256811 3.19828 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
878 -1.35462 25.0 -0.475259 0.766149 -1.2304 -0.212906 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 -1.11148 -0.561427 1.95494 -1.04552 -0.331476 -0.205441 2.43533 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 -1.16626 -0.256811 3.19828 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 -0.457791 -1.06218 -0.290093 2.19055
879 0.737384 33.0 -0.475259 -0.473087 0.811831 -0.409018 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 -1.16626 -0.256811 -0.312316 1.63292 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
880 -1.35462 22.0 -0.475259 -0.473087 0.811831 -0.308318 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 -1.06218 -0.290093 2.19055
881 0.737384 28.0 -0.475259 -0.473087 0.811831 -0.30896 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 -1.11148 -0.561427 1.95494 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 -1.06218 -0.290093 2.19055
882 0.737384 25.0 -0.475259 -0.473087 0.811831 -0.441515 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
883 -1.35462 39.0 -0.475259 5.72309 -1.2304 -0.525882 -1.17437 2.45781 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 -1.04552 -0.331476 -0.205441 -0.41016 3.96699 -0.196349 -0.199419 -0.249261 -1.62369 -0.481001 3.24355 -1.16626 -0.256811 3.19828 -0.611712 -0.697774 -1.2304 3.64796 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
884 0.737384 27.0 -0.475259 -0.473087 0.811831 -0.212906 -1.17437 -0.406409 -0.510949 -0.217064 6.12914 -1.11148 -0.561427 1.95494 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 -1.06218 -0.290093 2.19055
885 -1.35462 19.0 -0.475259 -0.473087 0.811831 0.440263 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 4.00735 0.615187 -0.481001 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
886 -1.35462 missing 0.430385 2.00539 -1.2304 -0.48714 -1.17437 -0.406409 1.95494 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 -1.04552 -0.331476 -0.205441 2.43533 -0.251797 -0.196349 -0.199419 -0.249261 0.615187 -0.481001 -0.307957 -1.16626 -0.256811 3.19828 -0.611712 1.43152 -1.2304 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
887 0.737384 26.0 -0.475259 -0.473087 0.811831 0.440263 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 -1.11148 1.77917 -0.510949 -1.04552 3.01342 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 -1.62369 2.07666 -0.307957 0.856474 -0.256811 -0.312316 -0.611712 -0.697774 0.811831 -0.273817 -0.21989 2.18194 -1.06218 -0.290093 -0.455992
888 0.737384 32.0 -0.475259 -0.473087 0.811831 -0.414619 0.850562 -0.406409 -0.510949 -0.217064 -0.162971 0.89869 -0.561427 -0.510949 0.955387 -0.331476 -0.205441 -0.41016 -0.251797 -0.196349 -0.199419 -0.249261 -1.62369 -0.481001 3.24355 -1.16626 -0.256811 -0.312316 1.63292 -0.697774 0.811831 -0.273817 -0.21989 -0.457791 0.940401 -0.290093 -0.455992
cv = MLJ.StratifiedCV(nfolds=10, shuffle=true, rng=0)

logreg = LogisticClassifier(penalty=:l2)
mach_logreg = MLJ.machine(logreg, X_scaled, y)
MLJ.fit!(mach_logreg)

evaluation = MLJ.evaluate!(mach_logreg, resampling=cv, verbosity=0, measure=[MLJ.Accuracy()]);

println("Cross-Validation accuracy scores: ", evaluation.per_fold)
println("Mean Cross-Validation accuracy score: ", MLJ.mean(evaluation.per_fold[1]))
UndefVarError: `X_scaled` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2045

Grid Search on Logistic Regression

  • What is grid search?
  • What are the pros and cons?

Gridsearch is a simple concept but effective technique in Machine Learning. The word GridSearch stands for the fact that we are searching for optimal parameter/parameters over a “grid.” These optimal parameters are also known as Hyperparameters. The Hyperparameters are model parameters that are set before fitting the model and determine the behavior of the model.. For example, when we choose to use linear regression, we may decide to add a penalty to the loss function such as Ridge or Lasso. These penalties require specific alpha (the strength of the regularization technique) to set beforehand. The higher the value of alpha, the more penalty is being added. GridSearch finds the optimal value of alpha among a range of values provided by us, and then we go on and use that optimal value to fit the model and get sweet results. It is essential to understand those model parameters are different from models outcomes, for example, coefficients or model evaluation metrics such as accuracy score or mean squared error are model outcomes and different than hyperparameters.

This part of the kernel is a working progress. Please check back again for future updates.

Random.seed!(30)

logreg = LogisticClassifier()
lambda_vals = exp10.(range(log10(0.01), stop=log10(0.1), length=50))
penalties = [:l1, :l2]

ranges = [
    range(logreg, :lambda, values = lambda_vals),
    range(logreg, :penalty, values = penalties)
]


cv = MLJ.StratifiedCV(nfolds=10, shuffle=true, rng=123)

tuned_logreg = MLJ.TunedModel(
    model=logreg,
    tuning=MLJ.Grid(resolution=1),
    resampling=cv,
    range=ranges,
    measure=MLJ.Accuracy(),  
    acceleration=MLJ.CPUThreads(), 
    acceleration_resampling=MLJ.CPUThreads()
)

# 5. Crear y entrenar la máquina
mach = MLJ.machine(tuned_logreg, X, y)
MLJ.fit!(mach)
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJTuning.ProbabilisticTunedModel` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
[ Info: Training machine(ProbabilisticTunedModel(model = LogisticClassifier(lambda = 2.220446049250313e-16, …), …), …).
[ Info: Attempting to evaluate 100 models.
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJLinearModels.LogisticClassifier` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
Error: Problem fitting the machine machine(LogisticClassifier(lambda = 0.028117686979742307, …), …). 
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:694
[ Info: Running type checks... 
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJLinearModels.LogisticClassifier` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
Error: Problem fitting the machine machine(Resampler(model = LogisticClassifier(lambda = 0.028117686979742307, …), …), …). 
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:694
[ Info: Running type checks... 
[ Info: Type checks okay. 
Error: Problem fitting the machine machine(ProbabilisticTunedModel(model = LogisticClassifier(lambda = 2.220446049250313e-16, …), …), …). 
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:694
[ Info: Running type checks... 
Warning: The number and/or types of data arguments do not match what the specified model
supports. Suppress this type check by specifying `scitype_check_level=0`.

Run `@doc MLJTuning.ProbabilisticTunedModel` to learn more about your model's requirements.

Commonly, but non exclusively, supervised models are constructed using the syntax
`machine(model, X, y)` or `machine(model, X, y, w)` while most other models are
constructed with `machine(model, X)`.  Here `X` are features, `y` a target, and `w`
sample or class weights.

In general, data in `machine(model, data...)` is expected to satisfy

    scitype(data) <: MLJ.fit_data_scitype(model)

In the present case:

scitype(data) = Tuple{ScientificTypesBase.Table{Union{AbstractVector{Union{Missing, ScientificTypesBase.Continuous}}, AbstractVector{ScientificTypesBase.Continuous}}}, AbstractVector{ScientificTypesBase.Multiclass{2}}}

fit_data_scitype(model) = Tuple{ScientificTypesBase.Table{<:AbstractVector{<:ScientificTypesBase.Continuous}}, AbstractVector{<:ScientificTypesBase.Finite}}
@ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:237
[ Info: It seems an upstream node in a learning network is providing data of incompatible scitype. See above. 
MethodError: no method matching fit(::MLJLinearModels.GeneralizedLinearRegression{MLJLinearModels.LogisticLoss, MLJLinearModels.ScaledPenalty{MLJLinearModels.L2Penalty}}, ::Matrix{Union{Missing, Float64}}, ::Vector{Int64}; solver::MLJLinearModels.LBFGS{Optim.Options{Float64, Nothing}, @NamedTuple{}})
The function `fit` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  fit(::MLJLinearModels.GeneralizedLinearRegression, ::AbstractMatrix{<:Real}, ::AbstractVector{<:Real}; data, solver)
   @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\fit\default.jl:36
  fit(::MLJLinearModels.GeneralizedLinearRegression; kwargs...)
   @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\fit\default.jl:50

Stacktrace:
  [1] fit(m::MLJLinearModels.LogisticClassifier, verb::Int64, X::DataFrames.DataFrame, y::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
    @ MLJLinearModels C:\Users\Fabrizio\.julia\packages\MLJLinearModels\s9vSj\src\mlj\interface.jl:74
  [2] fit_only!(mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}; rows::Vector{Int64}, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:692
  [3] fit_only!
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:617 [inlined]
  [4] #fit!#63
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:789 [inlined]
  [5] fit!
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:786 [inlined]
  [6] fit_and_extract_on_fold
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1463 [inlined]
  [7] (::MLJBase.var"#277#278"{MLJBase.var"#fit_and_extract_on_fold#304"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, Vector{typeof(MLJModelInterface.predict_mode)}, Bool, Bool, CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}}, MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, Int64})(k::Int64)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1289
  [8] _mapreduce(f::MLJBase.var"#277#278"{MLJBase.var"#fit_and_extract_on_fold#304"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, Vector{typeof(MLJModelInterface.predict_mode)}, Bool, Bool, CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}}, MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, Int64}, op::typeof(vcat), ::IndexLinear, A::UnitRange{Int64})
    @ Base .\reduce.jl:437
  [9] _mapreduce_dim
    @ .\reducedim.jl:337 [inlined]
 [10] mapreduce
    @ .\reducedim.jl:329 [inlined]
 [11] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#304"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, Vector{typeof(MLJModelInterface.predict_mode)}, Bool, Bool, CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}}, mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, ::ComputationalResources.CPU1{Nothing}, nfolds::Int64, verbosity::Int64)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1288
 [12] _evaluate!(func::MLJBase.var"#fit_and_extract_on_fold#304"{Vector{Tuple{Vector{Int64}, Vector{Int64}}}, Nothing, Nothing, Int64, Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, Vector{typeof(MLJModelInterface.predict_mode)}, Bool, Bool, CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}}}, mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, accel::ComputationalResources.CPUThreads{Int64}, nfolds::Int64, verbosity::Int64)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1343
 [13] evaluate!(mach::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, resampling::Vector{Tuple{Vector{Int64}, Vector{Int64}}}, weights::Nothing, class_weights::Nothing, rows::Nothing, verbosity::Int64, repeats::Int64, measures::Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, operations::Vector{typeof(MLJModelInterface.predict_mode)}, acceleration::ComputationalResources.CPUThreads{Int64}, force::Bool, per_observation_flag::Bool, logger::Nothing, user_resampling::MLJBase.StratifiedCV, compact::Bool)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1510
 [14] evaluate!(::MLJBase.Machine{MLJLinearModels.LogisticClassifier, MLJLinearModels.LogisticClassifier, true}, ::MLJBase.StratifiedCV, ::Nothing, ::Nothing, ::Nothing, ::Int64, ::Int64, ::Vector{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.FussyMeasure{StatisticalMeasuresBase.RobustMeasure{StatisticalMeasuresBase.Multimeasure{StatisticalMeasuresBase.SupportsMissingsMeasure{StatisticalMeasures.AccuracyOnScalars}, Nothing, StatisticalMeasuresBase.Mean, typeof(identity)}}, Nothing}}}, ::Vector{typeof(MLJModelInterface.predict_mode)}, ::ComputationalResources.CPUThreads{Int64}, ::Bool, ::Bool, ::Nothing, ::MLJBase.StratifiedCV, ::Bool)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1603
 [15] fit(::MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, ::Int64, ::DataFrames.DataFrame, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\resampling.jl:1764
 [16] fit_only!(mach::MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:692
 [17] fit_only!
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:617 [inlined]
 [18] #fit!#63
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:789 [inlined]
 [19] fit!
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:786 [inlined]
 [20] event!(metamodel::MLJLinearModels.LogisticClassifier, resampling_machine::MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false}, verbosity::Int64, tuning::MLJTuning.Grid, history::Nothing, state::@NamedTuple{models::Vector{MLJLinearModels.LogisticClassifier}, fields::Vector{Symbol}, parameter_scales::Vector{Symbol}, models_delivered::Bool})
    @ MLJTuning C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:469
 [21] #36
    @ C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:508 [inlined]
 [22] iterate
    @ .\generator.jl:48 [inlined]
 [23] _collect(c::Vector{MLJLinearModels.LogisticClassifier}, itr::Base.Generator{Vector{MLJLinearModels.LogisticClassifier}, MLJTuning.var"#36#37"{MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false}, Int64, MLJTuning.Grid, Nothing, @NamedTuple{models::Vector{MLJLinearModels.LogisticClassifier}, fields::Vector{Symbol}, parameter_scales::Vector{Symbol}, models_delivered::Bool}, ProgressMeter.Progress}}, ::Base.EltypeUnknown, isz::Base.HasShape{1})
    @ Base .\array.jl:811
 [24] collect_similar
    @ .\array.jl:720 [inlined]
 [25] map
    @ .\abstractarray.jl:3371 [inlined]
 [26] assemble_events!(metamodels::Vector{MLJLinearModels.LogisticClassifier}, resampling_machine::MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false}, verbosity::Int64, tuning::MLJTuning.Grid, history::Nothing, state::@NamedTuple{models::Vector{MLJLinearModels.LogisticClassifier}, fields::Vector{Symbol}, parameter_scales::Vector{Symbol}, models_delivered::Bool}, acceleration::ComputationalResources.CPU1{Nothing})
    @ MLJTuning C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:507
 [27] assemble_events!(metamodels::Vector{MLJLinearModels.LogisticClassifier}, resampling_machine::MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false}, verbosity::Int64, tuning::MLJTuning.Grid, history::Nothing, state::@NamedTuple{models::Vector{MLJLinearModels.LogisticClassifier}, fields::Vector{Symbol}, parameter_scales::Vector{Symbol}, models_delivered::Bool}, acceleration::ComputationalResources.CPUThreads{Int64})
    @ MLJTuning C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:573
 [28] build!(history::Nothing, n::Int64, tuning::MLJTuning.Grid, model::MLJLinearModels.LogisticClassifier, model_buffer::Channel{Any}, state::@NamedTuple{models::Vector{MLJLinearModels.LogisticClassifier}, fields::Vector{Symbol}, parameter_scales::Vector{Symbol}, models_delivered::Bool}, verbosity::Int64, acceleration::ComputationalResources.CPUThreads{Int64}, resampling_machine::MLJBase.Machine{MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, MLJBase.Resampler{MLJBase.StratifiedCV, Nothing}, false})
    @ MLJTuning C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:702
 [29] fit(::MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, ::Int64, ::DataFrames.DataFrame, ::CategoricalArrays.CategoricalVector{Int64, UInt32, Int64, CategoricalArrays.CategoricalValue{Int64, UInt32}, Union{}})
    @ MLJTuning C:\Users\Fabrizio\.julia\packages\MLJTuning\vMe8s\src\tuned_models.jl:786
 [30] fit_only!(mach::MLJBase.Machine{MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, false}; rows::Nothing, verbosity::Int64, force::Bool, composite::Nothing)
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:692
 [31] fit_only!
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:617 [inlined]
 [32] #fit!#63
    @ C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:789 [inlined]
 [33] fit!(mach::MLJBase.Machine{MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, false})
    @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:786
 [34] top-level scope
    @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2090
best_model = MLJ.fitted_params(mach).best_model
println("Mejores parámetros: ")
println("  lambda = ", best_model.lambda)
println("  penalty = ", best_model.penalty)

MLJ.report(mach).best_history_entry
machine(ProbabilisticTunedModel(model = LogisticClassifier(lambda = 2.220446049250313e-16, …), …), …) has not been trained. Call `fit!` on the machine, or, if you meant to create a learning network `Node`, use the syntax `node(fitted_params, mach::Machine)`. 
Stacktrace:
 [1] fitted_params(mach::MLJBase.Machine{MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, MLJTuning.ProbabilisticTunedModel{MLJTuning.Grid, MLJLinearModels.LogisticClassifier, Nothing}, false})
   @ MLJBase C:\Users\Fabrizio\.julia\packages\MLJBase\7nGJF\src\machines.jl:829
 [2] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2095

Using the best parameters from the grid-search.

#MLJ.predict(mach, X_test) # the mach already has the best model

This concludes the Julia notebook. As you can see, the interface and structure for building models are very similar to Python. The notebook now continues with the original Python examples. Hope you enjoy reading the notebook 😊

Resources:

Under-fitting & Over-fitting:

  • Confusion Matrix So, we have our first model and its score. But, how do we make sure that our model is performing well. Our model may be overfitting or underfitting. In fact, for those of you don’t know what overfitting and underfitting is, Let’s find out.

As you see in the chart above. Underfitting is when the model fails to capture important aspects of the data and therefore introduces more bias and performs poorly. On the other hand, Overfitting is when the model performs too well on the training data but does poorly in the validation set or test sets. This situation is also known as having less bias but more variation and perform poorly as well. Ideally, we want to configure a model that performs well not only in the training data but also in the test data. This is where bias-variance tradeoff comes in. When we have a model that overfits, meaning less biased and more of variance, we introduce some bias in exchange of having much less variance. One particular tactic for this task is regularization models (Ridge, Lasso, Elastic Net). These models are built to deal with the bias-variance tradeoff. This kernel explains this topic well. Also, the following chart gives us a mental picture of where we want our models to be.

Ideally, we want to pick a sweet spot where the model performs well in training set, validation set, and test set. As the model gets complex, bias decreases, variance increases. However, the most critical part is the error rates. We want our models to be at the bottom of that U shape where the error rate is the least. That sweet spot is also known as Optimum Model Complexity(OMC).

Now that we know what we want in terms of under-fitting and over-fitting, let’s talk about how to combat them.

How to combat over-fitting?

  • Simplify the model by using less parameters.
  • Simplify the model by changing the hyperparameters.
  • Introducing regularization models.
  • Use more training data.
  • Gatter more data ( and gather better quality data).

#### This part of the kernel is a working progress. Please check back again for future updates.####

7b. K-Nearest Neighbor classifier(KNN)


::: {#212 .cell _uuid=‘953bc2c18b5fd93bcd51a42cc04a0539d86d5bac’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:40.216328Z”,“iopub.status.busy”:“2021-06-26T16:35:40.215853Z”,“iopub.status.idle”:“2021-06-26T16:35:40.416985Z”,“shell.execute_reply”:“2021-06-26T16:35:40.416038Z”,“shell.execute_reply.started”:“2021-06-26T16:35:40.216141Z”}’ execution_count=1}

## Importing the model.
from sklearn.neighbors import KNeighborsClassifier
## calling on the model oject.
knn = KNeighborsClassifier(metric='minkowski', p=2)
## knn classifier works by doing euclidian distance


## doing 10 fold staratified-shuffle-split cross validation
cv = StratifiedShuffleSplit(n_splits=10, test_size=.25, random_state=2)

accuracies = cross_val_score(knn, X,y, cv = cv, scoring='accuracy')
print ("Cross-Validation accuracy scores:{}".format(accuracies))
print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),3)))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

Manually find the best possible k value for KNN

::: {#214 .cell _uuid=‘9c0f44165e08f63ae5436180c5a7182e6db5c63f’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:40.418857Z”,“iopub.status.busy”:“2021-06-26T16:35:40.418419Z”,“iopub.status.idle”:“2021-06-26T16:35:46.541601Z”,“shell.execute_reply”:“2021-06-26T16:35:46.540815Z”,“shell.execute_reply.started”:“2021-06-26T16:35:40.418687Z”}’ execution_count=1}

## Search for an optimal value of k for KNN.
k_range = range(1,31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X,y, cv = cv, scoring = 'accuracy')
    k_scores.append(scores.mean())
print("Accuracy scores are: {}\n".format(k_scores))
print ("Mean accuracy score: {}".format(np.mean(k_scores)))
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2177:18\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2177:18]8;;\
k_scores = []
#                ┌
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
#──┘ ── line break after `:` in range expression
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2177

:::

::: {#216 .cell _uuid=‘e123680b431ba99d399fa8205c32bcfdc7cabd81’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:46.543234Z”,“iopub.status.busy”:“2021-06-26T16:35:46.542789Z”,“iopub.status.idle”:“2021-06-26T16:35:46.685143Z”,“shell.execute_reply”:“2021-06-26T16:35:46.684141Z”,“shell.execute_reply.started”:“2021-06-26T16:35:46.543184Z”}’ execution_count=1}

from matplotlib import pyplot as plt
plt.plot(k_range, k_scores)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

Grid search on KNN classifier

::: {#218 .cell _uuid=‘507e2a7cdb28a47be45ed247f1343c123a6b592b’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:46.687026Z”,“iopub.status.busy”:“2021-06-26T16:35:46.686671Z”,“iopub.status.idle”:“2021-06-26T16:35:55.465245Z”,“shell.execute_reply”:“2021-06-26T16:35:55.464452Z”,“shell.execute_reply.started”:“2021-06-26T16:35:46.686956Z”}’ execution_count=1}

from sklearn.model_selection import GridSearchCV
## trying out multiple values for k
k_range = range(1,31)
##
weights_options=['uniform','distance']
#
param = {'n_neighbors':k_range, 'weights':weights_options}
## Using startifiedShufflesplit.
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.
grid = GridSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1)
## Fitting the model.
grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

::: {#220 .cell _uuid=‘c710770daa6cf327dcc28e18b3ed180fabecd49b’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:55.466929Z”,“iopub.status.busy”:“2021-06-26T16:35:55.466654Z”,“iopub.status.idle”:“2021-06-26T16:35:55.475348Z”,“shell.execute_reply”:“2021-06-26T16:35:55.474575Z”,“shell.execute_reply.started”:“2021-06-26T16:35:55.466883Z”}’ execution_count=1}

print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2215

:::

Using best estimator from grid search using KNN.

::: {#222 .cell _uuid=‘dd1fbf223c4ec9db65dde4924e2827e46029da1a’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:55.477181Z”,“iopub.status.busy”:“2021-06-26T16:35:55.476629Z”,“iopub.status.idle”:“2021-06-26T16:35:55.555736Z”,“shell.execute_reply”:“2021-06-26T16:35:55.554788Z”,“shell.execute_reply.started”:“2021-06-26T16:35:55.476983Z”}’ execution_count=1}

### Using the best parameters from the grid-search.
knn_grid= grid.best_estimator_
knn_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2226

:::

Using RandomizedSearchCV

Randomized search is a close cousin of grid search. It doesn’t always provide the best result but its fast.

::: {#224 .cell _uuid=‘e159b267a57d7519fc0ee8b3d1e95b841d3daf60’ execution=‘{“iopub.execute_input”:“2021-06-26T16:35:55.557501Z”,“iopub.status.busy”:“2021-06-26T16:35:55.557097Z”,“iopub.status.idle”:“2021-06-26T16:36:02.332003Z”,“shell.execute_reply”:“2021-06-26T16:36:02.331364Z”,“shell.execute_reply.started”:“2021-06-26T16:35:55.557338Z”}’ execution_count=1}

from sklearn.model_selection import RandomizedSearchCV
## trying out multiple values for k
k_range = range(1,31)
##
weights_options=['uniform','distance']
#
param = {'n_neighbors':k_range, 'weights':weights_options}
## Using startifiedShufflesplit.
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30)
# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.
## for RandomizedSearchCV,
grid = RandomizedSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1, n_iter=40)
## Fitting the model.
grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

::: {#226 .cell _uuid=‘c58492525dd18659ef9f9c774ee7601a55e96f36’ execution=‘{“iopub.execute_input”:“2021-06-26T16:36:02.333632Z”,“iopub.status.busy”:“2021-06-26T16:36:02.333341Z”,“iopub.status.idle”:“2021-06-26T16:36:02.340211Z”,“shell.execute_reply”:“2021-06-26T16:36:02.338113Z”,“shell.execute_reply.started”:“2021-06-26T16:36:02.333572Z”}’ execution_count=1}

print (grid.best_score_)
print (grid.best_params_)
print(grid.best_estimator_)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2256:6\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2256:6]8;;\
#| execution: {iopub.execute_input: '2021-06-26T16:36:02.333632Z', iopub.status.busy: '2021-06-26T16:36:02.333341Z', iopub.status.idle: '2021-06-26T16:36:02.340211Z', shell.execute_reply: '2021-06-26T16:36:02.338113Z', shell.execute_reply.started: '2021-06-26T16:36:02.333572Z'}
print (grid.best_score_)
#    ╙ ── whitespace is not allowed here
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2256

:::

::: {#228 .cell _uuid=‘6fb31588585d50de773ba0db6c378363841a5313’ execution=‘{“iopub.execute_input”:“2021-06-26T16:36:02.343117Z”,“iopub.status.busy”:“2021-06-26T16:36:02.34256Z”,“iopub.status.idle”:“2021-06-26T16:36:02.420683Z”,“shell.execute_reply”:“2021-06-26T16:36:02.419712Z”,“shell.execute_reply.started”:“2021-06-26T16:36:02.342922Z”}’ execution_count=1}

### Using the best parameters from the grid-search.
knn_ran_grid = grid.best_estimator_
knn_ran_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2265

:::

Gaussian Naive Bayes


::: {#230 .cell _uuid=‘8b2435030dbef1303bfc2864d227f5918f359330’ execution=‘{“iopub.execute_input”:“2021-06-26T16:36:02.422487Z”,“iopub.status.busy”:“2021-06-26T16:36:02.421997Z”,“iopub.status.idle”:“2021-06-26T16:36:02.433216Z”,“shell.execute_reply”:“2021-06-26T16:36:02.43234Z”,“shell.execute_reply.started”:“2021-06-26T16:36:02.422237Z”}’ execution_count=1}

# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

gaussian = GaussianNB()
gaussian.fit(X, y)
y_pred = gaussian.predict(X_test)
gaussian_accy = round(accuracy_score(y_pred, y_test), 3)
print(gaussian_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

Support Vector Machines(SVM)


::: {#232 .cell _uuid=‘56895672215b0b6365c6aaa10e446216ef635f53’ execution=‘{“iopub.execute_input”:“2021-06-26T16:36:02.435838Z”,“iopub.status.busy”:“2021-06-26T16:36:02.435282Z”,“iopub.status.idle”:“2021-06-26T16:37:25.882123Z”,“shell.execute_reply”:“2021-06-26T16:37:25.881483Z”,“shell.execute_reply.started”:“2021-06-26T16:36:02.435553Z”}’ execution_count=1}

from sklearn.svm import SVC
Cs = [0.001, 0.01, 0.1, 1,1.5,2,2.5,3,4,5, 10] ## penalty parameter C for the error term.
gammas = [0.0001,0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
grid_search = GridSearchCV(SVC(kernel = 'rbf', probability=True), param_grid, cv=cv) ## 'rbf' stands for gaussian kernel
grid_search.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

::: {#234 .cell _uuid=‘4108264ea5d18e3d3fa38a30584a032c734d6d49’ execution=‘{“iopub.execute_input”:“2021-06-26T16:37:25.8839Z”,“iopub.status.busy”:“2021-06-26T16:37:25.883609Z”,“iopub.status.idle”:“2021-06-26T16:37:25.890029Z”,“shell.execute_reply”:“2021-06-26T16:37:25.889244Z”,“shell.execute_reply.started”:“2021-06-26T16:37:25.883852Z”}’ execution_count=1}

print(grid_search.best_score_)
print(grid_search.best_params_)
print(grid_search.best_estimator_)
UndefVarError: `grid_search` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2310

:::

::: {#236 .cell _uuid=‘db18a3b5475f03b21a039e31e4962c43f7caffdc’ execution=‘{“iopub.execute_input”:“2021-06-26T16:37:25.892123Z”,“iopub.status.busy”:“2021-06-26T16:37:25.891542Z”,“iopub.status.idle”:“2021-06-26T16:37:25.934216Z”,“shell.execute_reply”:“2021-06-26T16:37:25.933352Z”,“shell.execute_reply.started”:“2021-06-26T16:37:25.892073Z”}’ execution_count=1}

# using the best found hyper paremeters to get the score.
svm_grid = grid_search.best_estimator_
svm_grid.score(X,y)
UndefVarError: `grid_search` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2319

:::

Decision Tree Classifier

Decision tree works by breaking down the dataset into small subsets. This breaking down process is done by asking questions about the features of the datasets. The idea is to unmix the labels by asking fewer questions necessary. As we ask questions, we are breaking down the dataset into more subsets. Once we have a subgroup with only the unique type of labels, we end the tree in that node. If you would like to get a detailed understanding of Decision tree classifier, please take a look at this kernel.

::: {#238 .cell _cell_guid=‘38c90de9-d2e9-4341-a378-a854762d8be2’ _uuid=‘18efb62b713591d1512010536ff10d9f6a91ec11’ execution=‘{“iopub.execute_input”:“2021-06-26T16:37:25.936111Z”,“iopub.status.busy”:“2021-06-26T16:37:25.935654Z”,“iopub.status.idle”:“2021-06-26T16:37:57.983942Z”,“shell.execute_reply”:“2021-06-26T16:37:57.983035Z”,“shell.execute_reply.started”:“2021-06-26T16:37:25.935918Z”}’ execution_count=1}

from sklearn.tree import DecisionTreeClassifier
max_depth = range(1,30)
max_feature = [21,22,23,24,25,26,28,29,30,'auto']
criterion=["entropy", "gini"]

param = {'max_depth':max_depth,
         'max_features':max_feature,
         'criterion': criterion}
grid = GridSearchCV(DecisionTreeClassifier(),
                                param_grid = param,
                                 verbose=False,
                                 cv=StratifiedKFold(n_splits=20, random_state=15, shuffle=True),
                                n_jobs = -1)
grid.fit(X, y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

::: {#240 .cell _cell_guid=‘b2222e4e-f5f2-4601-b95f-506d7811610a’ _uuid=‘b0fb5055e6b4a7fb69ef44f669c4df693ce46212’ execution=‘{“iopub.execute_input”:“2021-06-26T16:37:57.988346Z”,“iopub.status.busy”:“2021-06-26T16:37:57.988045Z”,“iopub.status.idle”:“2021-06-26T16:37:57.994617Z”,“shell.execute_reply”:“2021-06-26T16:37:57.993662Z”,“shell.execute_reply.started”:“2021-06-26T16:37:57.988287Z”}’ scrolled=‘true’ execution_count=1}

print( grid.best_params_)
print (grid.best_score_)
print (grid.best_estimator_)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2352

:::

::: {#242 .cell _cell_guid=‘d731079a-31b4-429a-8445-48597bb2639d’ _uuid=‘76c26437d374442826ef140574c5c4880ae1e853’ execution=‘{“iopub.execute_input”:“2021-06-26T16:37:57.996876Z”,“iopub.status.busy”:“2021-06-26T16:37:57.996238Z”,“iopub.status.idle”:“2021-06-26T16:37:58.010892Z”,“shell.execute_reply”:“2021-06-26T16:37:58.010194Z”,“shell.execute_reply.started”:“2021-06-26T16:37:57.996695Z”}’ execution_count=1}

dectree_grid = grid.best_estimator_
## using the best found hyper paremeters to get the score.
dectree_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2361

:::

Let’s look at the feature importance from decision tree grid.

## feature importance
feature_importances = pd.DataFrame(dectree_grid.feature_importances_,
                                   index = column_names,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head(10)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2373:47\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2373:47]8;;\
                                   index = column_names,
                                    columns=['importance'])
#                                             └────────┘ ── character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2373

These are the top 10 features determined by Decision Tree helped classifing the fates of many passenger on Titanic on that night.

7f. Random Forest Classifier

I admire working with decision trees because of the potential and basics they provide towards building a more complex model like Random Forest(RF). RF is an ensemble method (combination of many decision trees) which is where the “forest” part comes in. One crucial details about Random Forest is that while using a forest of decision trees, RF model takes random subsets of the original dataset(bootstrapped) and random subsets of the variables(features/columns). Using this method, the RF model creates 100’s-1000’s(the amount can be menually determined) of a wide variety of decision trees. This variety makes the RF model more effective and accurate. We then run each test data point through all of these 100’s to 1000’s of decision trees or the RF model and take a vote on the output.

from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
n_estimators = [140,145,150,155,160];
max_depth = range(1,10);
criterions = ['gini', 'entropy'];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)


parameters = {'n_estimators':n_estimators,
              'max_depth':max_depth,
              'criterion': criterions

        }
grid = GridSearchCV(estimator=RandomForestClassifier(max_features='auto'),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2409:6\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2409:6]8;;\
#| execution: {iopub.execute_input: '2021-06-26T16:39:53.559492Z', iopub.status.busy: '2021-06-26T16:39:53.559192Z', iopub.status.idle: '2021-06-26T16:39:53.567897Z', shell.execute_reply: '2021-06-26T16:39:53.56675Z', shell.execute_reply.started: '2021-06-26T16:39:53.559434Z'}
print (grid.best_score_)
#    ╙ ── whitespace is not allowed here
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2409
rf_grid = grid.best_estimator_
rf_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2416

::: {#252 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:39:53.602628Z”,“iopub.status.busy”:“2021-06-26T16:39:53.602028Z”,“iopub.status.idle”:“2021-06-26T16:39:53.613347Z”,“shell.execute_reply”:“2021-06-26T16:39:53.612229Z”,“shell.execute_reply.started”:“2021-06-26T16:39:53.602297Z”}’ execution_count=1}

from sklearn.metrics import classification_report
# Print classification report for y_test
print(classification_report(y_test, y_pred, labels=rf_grid.classes_))
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

Feature Importance

::: {#254 .cell _kg_hide-input=‘true’ execution=‘{“iopub.execute_input”:“2021-06-26T16:39:53.615537Z”,“iopub.status.busy”:“2021-06-26T16:39:53.614947Z”,“iopub.status.idle”:“2021-06-26T16:39:53.637392Z”,“shell.execute_reply”:“2021-06-26T16:39:53.63647Z”,“shell.execute_reply.started”:“2021-06-26T16:39:53.615192Z”}’ execution_count=1}

## feature importance
feature_importances = pd.DataFrame(rf_grid.feature_importances_,
                                   index = column_names,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head(10)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2436:47\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2436:47]8;;\
                                   index = column_names,
                                    columns=['importance'])
#                                             └────────┘ ── character literal contains multiple characters
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2436

:::

Why Random Forest?(Pros and Cons)


Introducing Ensemble Learning

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

There are two types of ensemple learnings.

Bagging/Averaging Methods

In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

Boosting Methods

The other family of ensemble methods are boosting methods, where base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

Source:GA

Resource: Ensemble methods: bagging, boosting and stacking


7g. Bagging Classifier


Bagging Classifier(Bootstrap Aggregating) is the ensemble method that involves manipulating the training set by resampling and running algorithms on it. Let’s do a quick review:

  • Bagging classifier uses a process called bootstrapped dataset to create multiple datasets from one original dataset and runs algorithm on each one of them. Here is an image to show how bootstrapped dataset works.

    Resampling from original dataset to bootstrapped datasets

    Source: https://uc-r.github.io

  • After running a learning algorithm on each one of the bootstrapped datasets, all models are combined by taking their average. the test data/new data then go through this averaged classifier/combined classifier and predict the output.

Here is an image to make it clear on how bagging works,

Source: https://prachimjoshi.files.wordpress.com

Please check out this kernel if you want to find out more about bagging classifier.

from sklearn.ensemble import BaggingClassifier
n_estimators = [10,30,50,70,80,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)

parameters = {'n_estimators':n_estimators,

        }
grid = GridSearchCV(BaggingClassifier(base_estimator= None, ## If None, then the base estimator is a decision tree.
                                      bootstrap_features=False),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2503:6\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2503:6]8;;\
#| execution: {iopub.execute_input: '2021-06-26T16:40:17.164621Z', iopub.status.busy: '2021-06-26T16:40:17.164322Z', iopub.status.idle: '2021-06-26T16:40:17.172911Z', shell.execute_reply: '2021-06-26T16:40:17.172302Z', shell.execute_reply.started: '2021-06-26T16:40:17.164559Z'}
print (grid.best_score_)
#    ╙ ── whitespace is not allowed here
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2503
bagging_grid = grid.best_estimator_
bagging_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2510

Why use Bagging? (Pros and cons)

Bagging works best with strong and complex models(for example, fully developed decision trees). However, don’t let that fool you to thinking that similar to a decision tree, bagging also overfits the model. Instead, bagging reduces overfitting since a lot of the sample training data are repeated and used to create base estimators. With a lot of equally likely training data, bagging is not very susceptible to overfitting with noisy data, therefore reduces variance. However, the downside is that this leads to an increase in bias.

Random Forest VS. Bagging Classifier

If some of you are like me, you may find Random Forest to be similar to Bagging Classifier. However, there is a fundamental difference between these two which is Random Forests ability to pick subsets of features in each node. I will elaborate on this in a future update.

7h. AdaBoost Classifier


AdaBoost is another ensemble model and is quite different than Bagging. Let’s point out the core concepts.

AdaBoost combines a lot of “weak learners”(they are also called stump; a tree with only one node and two leaves) to make classifications.

This base model fitting is an iterative process where each stump is chained one after the other; It cannot run in parallel.

Some stumps get more say in the final classifications than others. The models use weights that are assigned to each data point/raw indicating their “importance.” Samples with higher weight have a higher influence on the total error of the next model and gets more priority. The first stump starts with uniformly distributed weight which means, in the beginning, every datapoint have an equal amount of weights.

Each stump is made by talking the previous stump’s mistakes into account. After each iteration weights gets re-calculated in order to take the errors/misclassifications from the last stump into consideration.

The final prediction is typically constructed by a weighted vote where weights for each base model depends on their training errors or misclassification rates.

To illustrate what we have talked about so far let’s look at the following visualization.

Source: Diogo(Medium)

Let’s dive into each one of the nitty-gritty stuff about AdaBoost:


First, we determine the best feature to split the dataset using Gini index(basics from decision tree). The feature with the lowest Gini index becomes the first stump in the AdaBoost stump chain(the lower the Gini index is, the better unmixed the label is, therefore, better split).


Secondly, we need to determine how much say a stump will have in the final classification and how we can calculate that.

  • We learn how much say a stump has in the final classification by calculating how well it classified the samples (aka calculate the total error of the weight).
  • The Total Error for a stump is the sum of the weights associated with the incorrectly classified samples. For example, lets say, we start a stump with 10 datasets. The first stump will uniformly distribute an weight amoung all the datapoints. Which means each data point will have 1/10 weight. Let’s say once the weight is distributed we run the model and find 2 incorrect predicitons. In order to calculate the total erorr we add up all the misclassified weights. Here we get 1/10 + 1/10 = 2/10 or 1/5. This is our total error. We can also think about it

\[ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} \]

  • Since the weight is uniformly distributed(all add up to 1) among all data points, the total error will always be between 0(perfect stump) and 1(horrible stump).
  • We use the total error to determine the amount of say a stump has in the final classification using the following formula

\[ \alpha_t = \frac{1}{2}ln \left(\frac{1-\epsilon_t}{\epsilon_t}\right) \text{where } \epsilon_t < 1\]

Where \(\epsilon_t\) is the misclassification rate for the current classifier:

\[ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} \]

Here…

  • \(\alpha_t\) = Amount of Say
  • \(\epsilon_t\) = Total error

We can draw a graph to determine the amount of say using the value of total error(0 to 1)

Source: Chris McCormick
  • The blue line tells us the amount of say for Total Error(Error rate) between 0 and 1.
  • When the stump does a reasonably good job, and the total error is minimal, then the amount of say(Alpha) is relatively large, and the alpha value is positive.
  • When the stump does an average job(similar to a coin flip/the ratio of getting correct and incorrect ~50%/50%), then the total error is ~0.5. In this case the amount of say is 0.
  • When the error rate is high let’s say close to 1, then the amount of say will be negative, which means if the stump outputs a value as “survived” the included weight will turn that value into “not survived.”

P.S. If the Total Error is 1 or 0, then this equation will freak out. A small amount of error is added to prevent this from happening.


Third, We need to learn how to modify the weights so that the next stump will take the errors that the current stump made into account. The pseducode for calculating the new sample weight is as follows.

\[ New Sample Weight = Sample Weight + e^{\alpha_t}\]

Here the \(\alpha_t(AmountOfSay)\) can be positive or negative depending whether the sample was correctly classified or misclassified by the current stump. We want to increase the sample weight of the misclassified samples; hinting the next stump to put more emphasize on those. Inversely, we want to decrease the sample weight of the correctly classified samples; hinting the next stump to put less emphasize on those.

The following equation help us to do this calculation.

\[ D\_{t+1}(i) = D_t(i) e^{-\alpha_t y_i h_t(x_i)} \]

Here,

  • \(D_{t+1}(i)\) = New Sample Weight.
  • \(D_t(i)\) = Current Sample weight.
  • \(\alpha_t\) = Amount of Say, alpha value, this is the coefficient that gets updated in each iteration and
  • \(y_i h_t(x_i)\) = place holder for 1 if stump correctly classified, -1 if misclassified.

Finally, we put together the combined classifier, which is

\[ AdaBoost(X) = sign\left(\sum\_{t=1}^T\alpha_t h_t(X)\right) \]

Here,

\(AdaBoost(X)\) is the classification predictions for \(y\) using predictor matrix \(X\)

\(T\) is the set of “weak learners”

\(\alpha_t\) is the contribution weight for weak learner \(t\)

\(h_t(X)\) is the prediction of weak learner \(t\)

and \(y\) is binary with values -1 and 1

P.S. Since the stump barely captures essential specs about the dataset, the model is highly biased in the beginning. However, as the chain of stumps continues and at the end of the process, AdaBoost becomes a strong tree and reduces both bias and variance.

Resources:

from sklearn.ensemble import AdaBoostClassifier
n_estimators = [100,140,145,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
learning_r = [0.1,1,0.01,0.5]

parameters = {'n_estimators':n_estimators,
              'learning_rate':learning_r

        }
grid = GridSearchCV(AdaBoostClassifier(base_estimator= None, ## If None, then the base estimator is a decision tree.
                                     ),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)
ParseError:
# Error @ ]8;;file://C:/Users/Fabrizio/Documents/Projects/Estudio-IA/dia10/julia-titanic-notebook/julia-titanic-wokflow.qmd#2649:6\C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2649:6]8;;\
#| execution: {iopub.execute_input: '2021-06-26T16:41:28.313135Z', iopub.status.busy: '2021-06-26T16:41:28.31287Z', iopub.status.idle: '2021-06-26T16:41:28.318909Z', shell.execute_reply: '2021-06-26T16:41:28.318191Z', shell.execute_reply.started: '2021-06-26T16:41:28.313088Z'}
print (grid.best_score_)
#    ╙ ── whitespace is not allowed here
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2649
adaBoost_grid = grid.best_estimator_
adaBoost_grid.score(X,y)
UndefVarError: `grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2656

Pros and cons of boosting


Pros

  • Achieves higher performance than bagging when hyper-parameters tuned properly.
  • Can be used for classification and regression equally well.
  • Easily handles mixed data types.
  • Can use “robust” loss functions that make the model resistant to outliers.

Cons

  • Difficult and time consuming to properly tune hyper-parameters.
  • Cannot be parallelized like bagging (bad scalability when huge amounts of data).
  • More risk of overfitting compared to bagging.

Resources:

7i. Gradient Boosting Classifier


::: {#268 .cell _cell_guid=‘d32d6df9-b8e7-4637-bacc-2baec08547b8’ _uuid=‘fd788c4f4cde834a1329f325f1f59e3f77c37e42’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:28.360536Z”,“iopub.status.busy”:“2021-06-26T16:41:28.360265Z”,“iopub.status.idle”:“2021-06-26T16:41:28.521396Z”,“shell.execute_reply”:“2021-06-26T16:41:28.520426Z”,“shell.execute_reply.started”:“2021-06-26T16:41:28.360479Z”}’ scrolled=‘true’ execution_count=1}

# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gradient_boost = GradientBoostingClassifier()
gradient_boost.fit(X, y)
y_pred = gradient_boost.predict(X_test)
gradient_accy = round(accuracy_score(y_pred, y_test), 3)
print(gradient_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

7j. XGBClassifier


::: {#270 .cell _cell_guid=‘5d94cc5b-d8b7-40d3-b264-138539daabfa’ _uuid=‘9d96154d2267ea26a6682a73bd1850026eb1303b’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:28.523177Z”,“iopub.status.busy”:“2021-06-26T16:41:28.522724Z”,“iopub.status.idle”:“2021-06-26T16:41:28.526955Z”,“shell.execute_reply”:“2021-06-26T16:41:28.525945Z”,“shell.execute_reply.started”:“2021-06-26T16:41:28.522964Z”}’ execution_count=1}

# from xgboost import XGBClassifier
# XGBClassifier = XGBClassifier()
# XGBClassifier.fit(X, y)
# y_pred = XGBClassifier.predict(X_test)
# XGBClassifier_accy = round(accuracy_score(y_pred, y_test), 3)
# print(XGBClassifier_accy)

:::

7k. Extra Trees Classifier


::: {#272 .cell _cell_guid=‘2e567e01-6b5f-4313-84af-cc378c3b709e’ _uuid=‘c9b958e2488adf6f79401c677087e3250d63ac9b’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:28.528841Z”,“iopub.status.busy”:“2021-06-26T16:41:28.528382Z”,“iopub.status.idle”:“2021-06-26T16:41:28.555697Z”,“shell.execute_reply”:“2021-06-26T16:41:28.554889Z”,“shell.execute_reply.started”:“2021-06-26T16:41:28.528664Z”}’ execution_count=1}

from sklearn.ensemble import ExtraTreesClassifier
ExtraTreesClassifier = ExtraTreesClassifier()
ExtraTreesClassifier.fit(X, y)
y_pred = ExtraTreesClassifier.predict(X_test)
extraTree_accy = round(accuracy_score(y_pred, y_test), 3)
print(extraTree_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

7l. Gaussian Process Classifier


::: {#274 .cell _cell_guid=‘23bd5744-e04d-49bb-9d70-7c2a518f76dd’ _uuid=‘57fc008eea2ce1c0b595f888a82ddeaee6ce2177’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:28.557268Z”,“iopub.status.busy”:“2021-06-26T16:41:28.556845Z”,“iopub.status.idle”:“2021-06-26T16:41:28.863352Z”,“shell.execute_reply”:“2021-06-26T16:41:28.862576Z”,“shell.execute_reply.started”:“2021-06-26T16:41:28.557221Z”}’ execution_count=1}

from sklearn.gaussian_process import GaussianProcessClassifier
GaussianProcessClassifier = GaussianProcessClassifier()
GaussianProcessClassifier.fit(X, y)
y_pred = GaussianProcessClassifier.predict(X_test)
gau_pro_accy = round(accuracy_score(y_pred, y_test), 3)
print(gau_pro_accy)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

7m. Voting Classifier


::: {#276 .cell _cell_guid=‘ac208dd3-1045-47bb-9512-de5ecb5c81b0’ _uuid=‘821c74bbf404193219eb91fe53755d669f5a14d1’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:28.865063Z”,“iopub.status.busy”:“2021-06-26T16:41:28.86463Z”,“iopub.status.idle”:“2021-06-26T16:41:30.314425Z”,“shell.execute_reply”:“2021-06-26T16:41:30.313671Z”,“shell.execute_reply.started”:“2021-06-26T16:41:28.865013Z”}’ execution_count=1}

from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(estimators=[
    ('lr_grid', logreg_grid),
    ('svc', svm_grid),
    ('random_forest', rf_grid),
    ('gradient_boosting', gradient_boost),
    ('decision_tree_grid',dectree_grid),
    ('knn_classifier', knn_grid),
#     ('XGB_Classifier', XGBClassifier),
    ('bagging_classifier', bagging_grid),
    ('adaBoost_classifier',adaBoost_grid),
    ('ExtraTrees_Classifier', ExtraTreesClassifier),
    ('gaussian_classifier',gaussian),
    ('gaussian_process_classifier', GaussianProcessClassifier)
],voting='hard')

#voting_classifier = voting_classifier.fit(train_x,train_y)
voting_classifier = voting_classifier.fit(X,y)
UndefVarError: UndefVarError(:from, Main.Notebook)
UndefVarError: `from` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.

:::

::: {#278 .cell _cell_guid=‘648ac6a6-2437-490a-bf76-1612a71126e8’ _uuid=‘518a02ae91cc91d618e476d1fc643cd3912ee5fb’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:30.316454Z”,“iopub.status.busy”:“2021-06-26T16:41:30.316008Z”,“iopub.status.idle”:“2021-06-26T16:41:30.42114Z”,“shell.execute_reply”:“2021-06-26T16:41:30.420152Z”,“shell.execute_reply.started”:“2021-06-26T16:41:30.31627Z”}’ execution_count=1}

y_pred = voting_classifier.predict(X_test)
voting_accy = round(accuracy_score(y_pred, y_test), 3)
print(voting_accy)
UndefVarError: `voting_classifier` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2807

:::

::: {#280 .cell _cell_guid=‘277534eb-7ec8-4359-a2f4-30f7f76611b8’ _kg_hide-input=‘true’ _uuid=‘00a9b98fd4e230db427a63596a2747f05b1654c1’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:30.422908Z”,“iopub.status.busy”:“2021-06-26T16:41:30.422475Z”,“iopub.status.idle”:“2021-06-26T16:41:30.426856Z”,“shell.execute_reply”:“2021-06-26T16:41:30.425882Z”,“shell.execute_reply.started”:“2021-06-26T16:41:30.422736Z”}’ execution_count=1}

#models = pd.DataFrame({
#    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
#              'Random Forest', 'Naive Bayes',
#              'Decision Tree', 'Gradient Boosting Classifier', 'Voting Classifier', 'XGB Classifier','ExtraTrees Classifier','Bagging Classifier'],
#    'Score': [svc_accy, knn_accy, logreg_accy,
#              random_accy, gaussian_accy, dectree_accy,
#               gradient_accy, voting_accy, XGBClassifier_accy, extraTree_accy, bagging_accy]})
#models.sort_values(by='Score', ascending=False)

:::

Part 8: Submit test predictions


::: {#282 .cell _uuid=‘eb0054822f296ba86aa6005b2a5e35fbc1aec88b’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:30.429099Z”,“iopub.status.busy”:“2021-06-26T16:41:30.42862Z”,“iopub.status.idle”:“2021-06-26T16:41:30.646363Z”,“shell.execute_reply”:“2021-06-26T16:41:30.645616Z”,“shell.execute_reply.started”:“2021-06-26T16:41:30.428903Z”}’ execution_count=1}

all_models = [logreg_grid,
              knn_grid,
              knn_ran_grid,
              svm_grid,
              dectree_grid,
              rf_grid,
              bagging_grid,
              adaBoost_grid,
              voting_classifier]

c = {}
for i in all_models:
    a = i.predict(X_test)
    b = accuracy_score(a, y_test)
    c[i] = b
UndefVarError: `logreg_grid` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2836

:::

::: {#284 .cell _cell_guid=‘51368e53-52e4-41cf-9cc9-af6164c9c6f5’ _uuid=‘b947f168f6655c1c6eadaf53f3485d57c0cd74c7’ execution=‘{“iopub.execute_input”:“2021-06-26T16:41:30.648318Z”,“iopub.status.busy”:“2021-06-26T16:41:30.647987Z”,“iopub.status.idle”:“2021-06-26T16:41:32.045557Z”,“shell.execute_reply”:“2021-06-26T16:41:32.044733Z”,“shell.execute_reply.started”:“2021-06-26T16:41:30.648259Z”}’ execution_count=1}

test_prediction = (max(c, key=c.get)).predict(test)
submission = pd.DataFrame({
        "PassengerId": passengerid,
        "Survived": test_prediction
    })

submission.PassengerId = submission.PassengerId.astype(int)
submission.Survived = submission.Survived.astype(int)

submission.to_csv("titanic1_submission.csv", index=False)
UndefVarError: `c` not defined in `Main.Notebook`
Suggestion: check for spelling errors or missing imports.
Stacktrace:
 [1] top-level scope
   @ C:\Users\Fabrizio\Documents\Projects\Estudio-IA\dia10\julia-titanic-notebook\julia-titanic-wokflow.qmd:2858

:::

<h1>Resources</h1>
<ul>
    <li><b>Statistics</b></li>
    <ul>
        <li><a href="https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php">Types of Standard Deviation</a></li>
        <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">What Is a t-test? And Why Is It Like Telling a Kid to Clean Up that Mess in the Kitchen?</a></li>
        <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics">What Are T Values and P Values in Statistics?</a></li>
        <li><a href="https://www.youtube.com/watch?v=E4KCfcVwzyw">What is p-value? How we decide on our confidence level.</a></li>
    </ul>
    <li><b>Writing pythonic code</b></li>
    <ul>
        <li><a href="https://www.kaggle.com/rtatman/six-steps-to-more-professional-data-science-code">Six steps to more professional data science code</a></li>
        <li><a href="https://www.kaggle.com/jpmiller/creating-a-good-analytics-report">Creating a Good Analytics Report</a></li>
        <li><a href="https://en.wikipedia.org/wiki/Code_smell">Code Smell</a></li>
        <li><a href="https://www.python.org/dev/peps/pep-0008/">Python style guides</a></li>
        <li><a href="https://gist.github.com/sloria/7001839">The Best of the Best Practices(BOBP) Guide for Python</a></li>
        <li><a href="https://www.python.org/dev/peps/pep-0020/">PEP 20 -- The Zen of Python</a></li>
        <li><a href="https://docs.python-guide.org/">The Hitchiker's Guide to Python</a></li>
        <li><a href="https://realpython.com/tutorials/best-practices/">Python Best Practice Patterns</a></li>
        <li><a href="http://www.nilunder.com/blog/2013/08/03/pythonic-sensibilities/">Pythonic Sensibilities</a></li>
    </ul>
    <li><b>Why Scikit-Learn?</b></li>
    <ul>
        <li><a href="https://www.oreilly.com/content/intro-to-scikit-learn/">Introduction to Scikit-Learn</a></li>
        <li><a href="https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/">Six reasons why I recommend scikit-learn</a></li>
        <li><a href="https://hub.packtpub.com/learn-scikit-learn/">Why you should learn Scikit-learn</a></li>
        <li><a href="https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines">A Deep Dive Into Sklearn Pipelines</a></li>
        <li><a href="https://www.kaggle.com/sermakarevich/sklearn-pipelines-tutorial">Sklearn pipelines tutorial</a></li>
        <li><a href="https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html">Managing Machine Learning workflows with Sklearn pipelines</a></li>
        <li><a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">A simple example of pipeline in Machine Learning using SKlearn</a></li>
    </ul>
</ul>
<h1>Credits</h1>
<ul>
    <li>To Brandon Foltz for his <a href="https://www.youtube.com/channel/UCFrjdcImgcQVyFbK04MBEhA">youtube</a> channel and for being an amazing teacher.</li>
    <li>To GA where I started my data science journey.</li>
    <li>To Kaggle community for inspiring me over and over again with all the resources I need.</li>
    <li>To Udemy Course "Deployment of Machine Learning". I have used and modified some of the code from this course to help making the learning process intuitive.</li>
</ul>

If you like to discuss any other projects or just have a chat about data science topics, I’ll be more than happy to connect with you on:

<ul>
    <li><a href="https://www.linkedin.com/in/masumrumi/"><b>LinkedIn</b></a></li>
    <li><a href="https://github.com/masumrumi"><b>Github</b></a></li>
    <li><a href="https://masumrumi.github.io/cv/"><b>masumrumi.github.io/cv/</b></a></li>
    <li><a href="https://www.youtube.com/channel/UC1mPjGyLcZmsMgZ8SJgrfdw"><b>Youtube</b></a></li>
</ul>

This kernel will always be a work in progress. I will incorporate new concepts of data science as I comprehend them with each update. If you have any idea/suggestions about this notebook, please let me know. Any feedback about further improvements would be genuinely appreciated.

If you have come this far, Congratulations!!

If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :)